Using AWS Managed Services, creating a cloud-based OLAP Cube and ETL framework

OLAP Cube and ETL framework with AWS Managed Services

For decades, businesses have relied on online analytical processing (OLAP) workloads to answer difficult business problems by filtering and consolidating data. These complicated searches took a lot of time to calculate and memory to store. To model and organise data, teams had to create and manage complicated extract, transform, and load (ETL) pipelines, frequently using commercial-grade analytics tools.

In this post, we'll go through how to create a cloud-based OLAP cube and ETL architecture that produces faster results at cheaper prices without losing performance by:

Using the cloud to connect your on-premises database for relevant to the development, transformation and discovery.
OLAP workloads may be run without the requirement for expensive third-party software licencing, specific infrastructure, or data migration.
Machine learning is used to categorise and display data using AWS Glue Data Catalog, Amazon QuickSight, Amazon Athena, and Amazon SageMaker (ML).

AWS Managed Services data analytics pipeline

AWS Managed Services are used in the suggested architecture in Figure 1. AWS Glue DataBrew is a no-code data transformation solution that lets you create transformation tasks easily. AWS Glue crawlers extract metadata from converted data and categorise it for Athena and QuickSight analytics and visualisation. SageMaker will create, train, and deploy machine learning models.

This strategy will assist you in getting answers from your collection to your consumers as quickly as possible without requiring you to convert your data to AWS. Because no code is required, you can easily use data transformation, cataloguing, analytics, and machine learning.

Figure: 1

AWS Managed Services Benefits for Data Analytics

On-demand access to on-premises database

The online transaction processing (OLTP) database in your company data centre is the starting point for the sample architecture in Figure 1. Figure 2 illustrates how to connect an OLTP database to DataBrew on AWS to perform OLAP workloads using a Java database connectivity (JDBC) connection. For typical data stores including MySQL, Microsoft SQL Server, Oracle, and DataBrew supports JDBC data sources, PostgreSQL.

Figure: 2

Data discovery is done automatically.

DataBrew summarises your data for discovery in Figures 3 through 6. You may profile your data to see if there are any trends or abnormalities. You may also use over 250 built-in transforms to perform transformations called "jobs" in DataBrew without writing any code.

Figure: 3

Figure: 4

Figure: 5

Transformation of data and categorising without the use of codes

You may build jobs based on the transformation stages illustrated in Figure 6 to conduct OLAP-type transactions. DataBrew recipe refer to all of these procedures taken together. These recipe results can be saved to an Amazon Simple Storage Services (Amazon S3) bucket by running them as a job.

Figure: 6

Scheduled DataBrew tasks work in a similar way as OLAP's scheduled ETL pipelines. DataBrew may conduct a task on a recurrent basis based on data refresh and business requirements (for example, every 11 hours). This can be scheduled at a certain time of day or according to a valid CRONs phrase. This makes it easier to automate your transformation processes.

The OLAP catalogue is a collection of data that stands between the OLAP data and the programs that use it. You may use AWS Glue crawlers to automatically categorise your data and establish its format, schema, and related characteristics to generate a Data Catalog. Figure 7 depicts the results of a crawler's findings published to Data Catalog as metadata to assist data consumers in finding the information they require.

Figure: 7

Data management without the use of software from third party

By using Athena to refer to the metadata definitions in the Data Catalog as references to the real data in Amazon S3, you may conduct analytics on your data. Without having to transfer data around, Athena is ideally suited for executing one-time queries using conventional SQL to query the converted data immediately in Amazon S3. Because Athena is serverless, you don't have to worry about maintaining infrastructure, and you just pay for the queries you execute.

Separate visualisation and management intelligence (MI) technologies are frequently used to enhance OLAP workloads. These programmes frequently have their own licencing, server administration, and security requirements.

A scalable, QuickSight, virtualization, embeddable, ML-powered of ML BI solution, may be used to view curated data. As demonstrated in Figure 8, QuickSight makes it simple to design and publish immersive BI dashboards that feature ML-powered insights. These dashboards may be shared with several other users and integrated into your own software.

Figure: 8

Finally, SageMaker allows you to combine machine learning and OLAP workloads. Previously, machine learning workloads were frequently costly, resource-intensive, and unavailable. SageMaker is a fully managed machine learning service that allows users to rapidly and simply design, train, and deploy model of ML into a performance hosted environment.

Conclusion

In this article, we'll teach you how to use a JDBC connection to connect your on-premises database to DataBrew for data profiling, discovery, and transformation. We looked at how DataBrew recipes and processes may be used to execute OLAP workloads without the need for expensive third-party software licencing, specialised infrastructure, or data migration. Without needing to maintain any servers, we looked at AWS capabilities in data cataloguing, visualisation, and Data Catalog through machine learning, Athena, QuickSight, and SageMaker.

Author: BIKKI SHAW

Updates on latest tech and news

Using AWS Managed Services, creating a cloud-based OLAP Cube and ETL framework

OLAP Cube and ETL framework with AWS Managed Services

AWS Managed Services data analytics pipeline