Data Warehousing on AWS

Jeralyn A. • July 3, 2023

Modern analytics and data warehousing architecture

Introducing Amazon Redshift

Amazon Redshift is a fast, fully managed, petabyte-scale data warehousing solution that makes it simple and cost-effective to analyze large volumes of data using existing business intelligence (BI) tools. With Amazon Redshift, you can get the performance of columnar data warehousing engines that perform massively parallel processing (MPP) at a tenth of the cost. You can start small for $0.25 per hour, with no commitments, and scale to petabytes for $1,000 per terabyte per year. You can grow to exabyte-scale storage by storing data in an Amazon Simple Storage Service (Amazon S3) data lake and taking a lake house approach to data warehousing with the Amazon Redshift Spectrum feature. With this setup, you can query data directly from files on Amazon S3 for as low as $5 per terabyte of data scanned.

Modern analytics and data warehousing architecture

Data typically flows into a data warehouse from transactional systems and other relational databases, and typically includes structured, semi-structured, and unstructured data. This data is processed, transformed, and ingested at a regular cadence. Users, including data scientists, business analysts, and decision-makers, access the data through BI tools, SQL clients, and other tools.

So why build a data warehouse at all? Why not just run analytics queries directly on an online transaction processing (OLTP) database, where the transactions are recorded? To answer the question, let’s look at the differences between data warehouses and OLTP databases.

Data warehouses are optimized for batched write operations and reading high volumes of data.
OLTP databases are optimized for continuous write operations and high volumes of small read operations.

Data warehouses generally employ denormalized schemas like the Star schema and Snowflake schema because of high data throughput requirements, whereas OLTP databases employ highly normalized schemas, which are more suited for high transaction throughput requirements.

To get the benefits of using a data warehouse managed as a separate data store with your source OLTP or other source system, we recommend that you build an efficient data pipeline. Such a pipeline extracts the data from the source system, converts it into a schema suitable for data warehousing, and then loads it into the data warehouse. In the next section, we discuss the building blocks of an analytics pipeline and the different AWS services you can use to architect the pipeline.

AWS analytics services help enterprises quickly convert their data to answers by providing mature and integrated analytics services, ranging from cloud data warehouses to serverless data lakes. Getting answers quickly means less time building plumbing and configuring cloud analytics services to work together. AWS helps you do exactly that by giving you:

An easy path to build data lakes and data warehouses and start running diverse analytics workloads.
A secure cloud storage, compute, and network infrastructure that meets the specific needs of analytic workloads.
A fully integrated analytics stack with a mature set of analytics tools, covering all common use cases and leveraging open file formats, standard SQL language, open-source engines, and platforms.
The best performance, the most scalability, and the lowest cost for analytics.

AWS is focused on helping customers build and secure data lakes and data warehouses in the cloud within days, not months. AWS Lake Formation enables secured, self-service discovery and access for users. Lake Formation provides easy, on-demand access to specific resources that fit the requirements of each analytics workload. The data is curated and cataloged, already prepared for any type of analytics. Related records are matched and de-duplicated with machine learning.

AWS analytics services

AWS provides a diverse set of analytics services that are deeply integrated with the infrastructure layers. This enables you to take advantage of features like intelligent tiering and Amazon Elastic Compute Cloud (Amazon EC2) spot instances, to reduce cost and run analytics faster.

When you’re ready for more advanced analytic approaches, use our broad collection of machine learning (ML) and artificial intelligence (AI) services against that same data in S3 to gain even more insight without the delays and costs of moving or transforming your data.

Analytics architecture

Analytics pipelines are designed to handle large volumes of incoming streams of data from heterogeneous sources such as databases, applications, and devices.

A typical analytics pipeline has the following stages:

Collect data
Store the data
Process the data
Analyze and visualize the data

Data collection

At the data collection stage, consider that you probably have different types of data, such as transactional data, log data, streaming data, and Internet of Things (IoT) data. AWS provides solutions for data storage for each of these types of data.

Log data

Reliably capturing system-generated logs helps you troubleshoot issues, conduct audits, and perform analytics using the information stored in the logs. Amazon S3 is a popular storage solution for non-transactional data, such as log data, that is used for analytics. Because it provides 99.999999999 percent durability, S3 is also a popular archival solution.

Streaming data

Web applications, mobile devices, and many software applications and services can generate staggering amounts of streaming data—sometimes terabytes per hour—that need to be collected, stored, and processed continuously. Using Amazon Kinesis services, you can do that simply and at a low cost. Alternatively, you can use Amazon Managed Streaming for Apache Kafka (Amazon MSK) to run applications that use Apache Kafka to process streaming data. With Amazon MSK, you can use native Apache Kafka application programming interfaces (APIs) to populate data lakes, stream changes to and from databases, and power ML and analytics applications.

IoT data

Devices and sensors around the world send messages continuously. Enterprises today need to capture this data and derive intelligence from it. Using AWS IoT, connected devices interact easily and securely with the AWS Cloud. Use AWS IoT to leverage AWS services like AWS Lambda, Amazon Kinesis Services, Amazon S3, Amazon Machine Learning, and Amazon DynamoDB to build applications that gather, process, analyze, and act on IoT data, without having to manage any infrastructure.

Data processing

The collection process provides data that potentially has useful information. You can analyze the extracted information for intelligence that will help you grow your business. This intelligence might, for example, tell you about your user behavior and the relative popularity of your products. The best practice to gather this intelligence is to load your raw data into a data warehouse to perform further analysis.

There are two types of processing workflows to accomplish this:

batch processing and real-time processing. The most common forms of processing, online analytic processing (OLAP) and OLTP, each use one of these types.

OLAP processing is generally batch-based. OLTP systems are oriented toward real-time processing, and are generally not well suited for batch-based processing. If you decouple data processing from your OLTP system, you keep the data processing from affecting your OLTP workload.

Batch Processing

Extract Transform Load (ETL) — ETL is the process of pulling data from multiple sources to load into data warehousing systems. ETL is normally a continuous, ongoing process with a well-defined workflow. During this process, data is initially extracted from one or more sources. The extracted data is then cleansed, enriched, transformed, and loaded into a data warehouse. For batch ETL, use AWS Glue or Amazon EMR. AWS Glue is a fully managed ETL service. You can create and run an ETL job with a few clicks in the AWS Management Console. Amazon EMR is for big data processing and analysis. EMR offers an expandable, low-configuration service as an easier alternative to running in-house cluster computing.

Extract Load Transform (ELT) — ELT is a variant of ETL, where the extracted data is loaded into the target system first. Transformations are performed after the data is loaded into the data warehouse. ELT typically works well when your target system is powerful enough to handle transformations. Amazon Redshift is often used in ELT pipelines, because it is highly efficient in performing transformations.

Online Analytical Processing (OLAP) — OLAP systems store aggregated historical data in multidimensional schemas. Used widely for query, reporting, and analytics, OLAP systems enable you to extract data and spot trends on multiple dimensions. Because it is optimized for fast joins, Amazon Redshift is often used to build OLAP systems.

Real-time processing

We talked about streaming data earlier and mentioned Amazon Kinesis Services and Amazon MSK as solutions to capture and store streaming data. You can process this data sequentially and incrementally on a record-by-record basis, or over sliding time windows. Use the processed data for a wide variety of analytics, including correlations, aggregations, filtering, and sampling. This type of processing is called real-time processing.

Real-time processing requires a highly concurrent and scalable processing layer.

To process streaming data in real-time, use AWS Lambda. Lambda can process the data directly from AWS IoT or Amazon Kinesis Data Streams. Lambda enables you to run code without provisioning or managing servers.

Amazon Kinesis Client Library (KCL) is another way to process data from Amazon Kinesis Streams. KCL gives you more flexibility than Lambda to batch your incoming data for further processing. You can also use KCL to apply extensive transformations and customizations in your processing logic.

Amazon Kinesis Data Firehose is the easiest way to load streaming data into AWS. It can capture streaming data and automatically load it into Amazon Redshift, enabling near-real-time analytics with existing BI tools, and dashboards you’re already using today. Define batching rules with Kinesis Data Firehose, and it takes care of reliably batching the data and delivering it to Amazon Redshift.

Amazon MSK is an easy way to build and run applications that use Apache Kafka to process streaming data. Apache Kafka is an open-source platform for building real-time streaming data pipelines and applications. With Amazon MSK, you can use native Apache Kafka APIs to populate data lakes, stream changes to and from databases, and power machine learning and analytics applications.

AWS Glue streaming jobs enable you to perform complex ETL on streaming data. Streaming ETL jobs in AWS Glue can consume data from streaming sources like Amazon Kinesis Data Streams and Amazon MSK, clean and transform those data streams in-flight, and continuously load the results into S3 data lakes, data warehouses, or other data stores. As you process streaming data in an AWS Glue job, you have access to the full capabilities of Spark Structured Streaming to implement data transformations, such as aggregating, partitioning, and formatting, as well as joining with other data sets to enrich or cleanse the data for easier analysis.

Data storage

You can store your data in a lake house, data warehouse, or data mart.

Lake house — A lake house is an architectural pattern that combines the best elements of data warehouses and data lakes. Lake houses enable you to query data across your data warehouse, data lake, and operational databases to gain faster and deeper insights that are not possible otherwise. With a lake house architecture, you can store data in open file formats in your data lake and query it in place while joining with data warehouse data. This enables you to make this data easily available to other analytics and machine learning tools, rather than locking it in a new silo.
Data warehouse — Using data warehouses, you can run fast analytics on large volumes of data and unearth patterns hidden in your data by leveraging BI tools. Data scientists query a data warehouse to perform offline analytics and spot trends. Users across the enterprise consume the data using SQL queries, periodic reports, and dashboards as needed to make critical business decisions.
Data mart — A data mart is a simple form of data warehouse focused on a specific functional area or subject matter. For example, you can have specific data marts for each division in your enterprise, or segment data marts based on regions. You can build data marts from a large data warehouse, operational stores, or a hybrid of the two. Data marts are simple to design, build, and administer. However, because data marts are focused on specific functional areas, querying across functional areas can become complex because of distribution.

You can use Amazon Redshift to build lake houses, data marts, and data warehouses. Redshift enables you to easily query data in your data lake and write data back to your data lake in open formats. You can use familiar SQL statements to combine and process data across all your data stores and execute queries on live data in your operational databases without requiring any data loading and ETL pipelines.

Analysis and visualization

After processing the data and making it available for further analysis, you need the right tools to analyze and visualize the processed data.

In many cases, you can perform data analysis using the same tools you use for processing data. You can use tools such as MySQL Workbench to analyze your data in Amazon Redshift with ANSI SQL. Amazon Redshift also works well with popular third-party BI solutions available on the market, such as Tableau and MicroStrategy.

Amazon QuickSight is a fast, cloud-powered BI service that enables you to create visualizations, perform analysis as needed, and quickly get business insights from your data. Amazon QuickSight offers native integration with AWS data sources such as Amazon Redshift, Amazon S3, and Amazon RDS. Amazon Redshift sources can be auto-detected by Amazon QuickSight, and can be queried either using a direct query or SPICE mode. SPICE is the in-memory optimized calculation engine for Amazon QuickSight, designed specifically for fast, as-needed data visualization. You can improve the performance of database datasets by importing the data into SPICE instead of using a direct query to the database.

If you are using Amazon S3 as your primary storage, you can use Amazon Athena/QuickSight integration to perform analysis and visualization. Amazon Athena is an interactive query service that makes it easy to analyze data in S3 using standard SQL. You can run SQL queries using Athena on data stored in S3 and build business dashboards within QuickSight.

For another visualization approach, Apache Zeppelin is an open-source BI solution that you can run on Amazon EMR to visualize data in Amazon S3 using Spark SQL. You can also use Apache Zeppelin to visualize data in Amazon Redshift.

Analytics pipeline with AWS services

AWS offers a broad set of services to implement an end-to-end analytics platform.

Data warehouse technology options

In this section, we discuss options available for building a data warehouse: row-oriented databases, column-oriented databases, and massively parallel processing architectures.

Row-oriented databases

Row-oriented databases typically store whole rows in a physical block. High performance for read operations is achieved through secondary indexes. Databases such as Oracle Database Server, Microsoft SQL Server, MySQL, and PostgreSQL are row-oriented database systems. These systems have been traditionally used for data warehousing, but they are better suited for transactional processing (OLTP) than for analytics.

To optimize performance of a row-based system used as a data warehouse, developers use a number of techniques, including:

Building materialized views
Creating pre-aggregated rollup tables
Building indexes on every possible predicate combination
Implementing data partitioning to leverage partition pruning by query optimizer
Performing index-based joins

Column-oriented databases

Column-oriented databases organize each column in its own set of physical blocks instead of packing the whole rows into a block. This functionality allows them to be more input/output (I/O) efficient for read-only queries, because they have to read only those columns accessed by a query from disk (or from memory). This approach makes column-oriented databases a better choice than row-oriented databases for data warehousing.

Row-oriented vs. column-oriented databases

After faster I/O, the next biggest benefit to using a column-oriented database is improved compression. Because every column is packed into its own set of blocks, every physical block contains the same data type.

When all the data is the same data type, the database can use extremely efficient compression algorithms. As a result, you need less storage compared to a row-oriented database. This approach also results in significantly lesser I/O, because the same data is stored in fewer blocks.

Massively Parallel Processing (MPP) architectures

An MPP architecture enables you to use all the resources available in the cluster for processing data, which dramatically increases performance of petabyte scale data warehouses. MPP data warehouses allow you improve performance by simply adding more nodes to the cluster. Amazon Redshift, Druid, Vertica, Greenplum, and Teradata Aster are some of the data warehouses built on an MPP architecture. Open-source frameworks such as Hadoop and Spark also support MPP.

< Older Post Newer Post >