Organizations often use cloud-based applications to analyze large amounts of data across a variety of scenarios, including system and application logs, business metrics, external data sources, public data sets, and input data for machine learning (ML) models.
AWS, the largest public cloud provider, offers a very wide range of services focused on big data and data analytics. These services may have overlapping capabilities, making it difficult to decide which one to choose.
While these three AWS services are suitable for a wide range of data analysis tasks, before selecting a specific service, it is important to evaluate the required integrations with relevant systems and data sources, consider the amount of data you intend to analyze, perform load testing, and evaluate costs according to the specific use case of your particular application.
Take a closer look at Amazon Redshift, Amazon Athena, and Amazon EMR to find which one best suits your data analytics needs.
Amazon Redshift
Amazon Redshift is a managed data warehouse that centrally stores and executes data analysis queries. The work is done on a Redshift cluster, which consists of one or more compute nodes that also store data. Although Redshift supports analyzing data stored in Amazon S3 using Amazon Redshift Spectrum, its primary focus is on analyzing data stored in the cluster itself. It also supports a serverless configuration that can access data stored in Redshift managed storage or externally.
Redshift brings together data from various sources and stores it in a structured pattern defined by databases and tables. It is a highly recommended service when you need to run complex queries fast on huge collections of structured and semi-structured data. Redshift can automatically ingest data from:
Amazon Relational Database Service (RDS). Amazon Aurora MySQL. Amazon Kinesis. Amazon Managed Streaming for Apache Kafka.
Additionally, other AWS services such as EMR, Glue, SageMaker, etc. can also access the stored data. Redshift can also run ML training and prediction processes directly on the available data.
Its primary model, provisioned clusters, makes it difficult to scale Redshift clusters down based on usage patterns because data is stored directly on the cluster. Provisioned Redshift clusters are often always-on, which typically means higher costs. IT teams can mitigate this issue by using reserved instances, which are charged at a discounted hourly rate for a one- or three-year term. Redshift Serverless is another cost-saving option, allocating adjustable compute power based on application requirements.
Compare relevant aspects of Redshift, Athena, and EMR.
Amazon Athena
Built on the open source Trino, Presto, and Spark engines, Amazon Athena is a serverless service for data analytics on AWS. It is widely used to analyze log data exported and stored in S3 for services such as:
Application Load Balancer. Amazon CloudFront. AWS CloudTrail. Amazon Data Firehose.
It also provides access to data defined in the AWS Glue catalog, supporting Amazon DynamoDB, CloudWatch, Open Database Connectivity/Java Database Connectivity drivers, and Redshift, and is integrated with ML inference endpoints to access ML models available through queries defined in Athena.
While this service is the easiest way to visualize data stored in S3, services like EMR and Redshift give developers control over the underlying infrastructure, potentially offering better performance at the expense of higher costs, and also integrate with Amazon QuickSight for automatic visualization of data and query results.
Data analysts use Athena to run queries using SQL syntax. It is a cost-effective tool in most cases because users do not need to explicitly configure the underlying computing infrastructure and the default configuration means that you only pay for the data scanned. However, it can be expensive for use cases with consistently high transaction volumes.
Athena is well suited for infrequent or ad-hoc data analysis needs because it does not require users to spin up any infrastructure and the service is always available to query data. Developers may also want to explore the Athena Provisioned Capacity feature to allocate a minimum amount of compute capacity, which is useful for predictable workloads.
Amazon EMR
Amazon EMR is a big data service that provides managed deployments of popular data analytics platforms such as Presto, Apache Spark, Apache Hadoop, Apache Hive, Apache Hudi, and Apache HBase. EMR automates the launch of compute and storage nodes running on Amazon EC2 instances, AWS Fargate, on-premises infrastructure managed by AWS Outposts, and serverless.
Data can be stored within EC2 instances using the Hadoop Distributed File System, but the service also supports querying data stored in sources outside the cluster, such as S3, DynamoDB, relational databases managed by RDS, etc. Data managed by EMR can also be accessed by SageMaker for ML training tasks.
EMR allows you to size your cluster based on your usage requirements to optimize costs and also supports Reserved Instances and Savings Plans for EC2 clusters, and Savings Plans for Fargate to help reduce costs.
EMR is well suited for predictable data analytics tasks, especially clusters that need to be available for extended periods of time, including data loads where control over the underlying infrastructure (i.e. EC2 instances) allows you to optimize performance and justify the extra effort.
Ernesto Marquez is the Owner and Project Director at Concurrency Labs, helping startups launch and grow their applications on AWS, with a particular focus on building serverless architectures, automating everything, and helping customers reduce their AWS costs.