Organize EMR workloads and optimize running cost with AWS Step Functions
By Rohan Mehta
Amazon Elastic MapReduce (AWS EMR) is a managed cluster platform that simplifies running frameworks like Apache Spark on AWS to process and analyze big data. It also allows you to move large amounts of data into and out of other AWS data stores and databases. However, a major challenge with AWS EMR is its inability to run multiple Spark jobs simultaneously.
This blog outlines how you can address this challenge by using AWS Step Functions to organize EMR workloads. This helps achieve the following:
- Parallel execution of multiple jobs
- Create Spark jobs for data processing
- Analyze workflows with minimal code
- Optimize cluster utilization for transient EMR
- Track each step of the workflow
AWS Step Functions allows you to coordinate and orchestrate a serverless application flow that can run on an independent Lambda function, EC2 (Elastic Cloud Compute), EMR (Elastic MapReduce), or on-premise. It introduces multiple new abstractions for developers such as:
1. A JSON standard (Amazon States Language) for defining application workflows as state machines encoded in JSON documents
2. A UI for visual editing/debugging of state machine executions
3. An API and UI for querying the execution history of individual state machines executions
AWS Step Functions development model
Developing workflows using Step Functions is similar to deploying Lambda functions for application requirements while developing workflows in a standard serverless application. However, in Step Functions, you will need to write JSON files to tie the Lambdas and form a cohesive overall application flow. Next, you need to investigate the console UI to debug and execute your application workflow. The diagram below details the three-step process:

AWS EMR integrates with Step Functions to orchestrate workflow capabilities, including parallel executions and dependencies from the result of a previous step. The integration also helps handle failures and exceptions while running data processing jobs.
To perform actions like create, terminate, or modify EMR clusters, and add or cancel EMR steps, you need to use Step Functions state machines. The state machines have seven distinct state types (task, choice, parallel, wait, fall, succeed, and pass) to help developers compose complex applications.
The Step Functions workflow is as follows:
1. Create a state machine using the Step Functions console
2. Use Amazon State Language to write the workflow
3. Step Function console creates the workflow with ASL
You can change the cluster termination protection using the EMR cluster. This enables developers to reuse an existing EMR cluster for their workflow or create an on-demand cluster during workflow execution.
Implementation use case
Impetus Technologies has extensive experience in helping enterprises design, architect, build, migrate, and manage their AWS workloads and applications. Here’s an overview of how we implemented AWS Step Functions on the enterprise execution model monitoring platform for one of the US-based banks.
The platform runs static (like “not null”, “range”, “valid values”, etc.) and dynamic checks (like “mean”, “threshold”) automatically on the model input/output data to manage and notify execution time risks in real-time. The platform runs Spark jobs on an EMR cluster to perform these checks. Cron, a time-based job scheduler, schedules jobs to run at fixed intervals (1 hour for static jobs, 6 hours for dynamic jobs) for multiple models via AWS CloudWatch.
However, since an EMR cluster can run only one Spark job at a time, the pending jobs were put in the queue, resulting in a high job workload. This resulted in the EMR cluster running for long hours, increasing the cost of running.
To control the cost and avoid long job queues, we implemented AWS Step Functions and created parallel workflows for different models. We ran PySpark jobs to organize the workflow and achieve transient EMR clusters, which could run Spark jobs in parallel at scheduled hours, thereby minimizing costs.
The diagram below explains the implementation model:

The integration connecting Step Functions with Amazon EMR helped drive the following benefits:
· Created data processing and analysis workflows with minimal code
· Saved implementation time
· Optimized cluster utilization
· Simplified orchestration of workflow capabilities, including parallel executions and dependencies resulting from previous steps
· Handled failures and exceptions when running data processing jobs
This is just one instance of how AWS Step Functions can organize EMR workloads to achieve the benefits mentioned above. Impetus Technologies has enabled massive-scale data analytics for several Fortune 1000 enterprises. Our expertise can help you design a solution to process vast amounts of data quickly and cost-effectively.