Building a customized Spark metrics collection tool for a global enterprise technology provider
A global enterprise technology provider for banking, retail, and hospitality industries wanted to modernize their Teradata legacy workloads to Spark Scala to enhance scalability and power faster analytics. For some of the strategic Teradata applications being migrated, the customer wanted to collect the following Spark metrics for each run:
· Number of scripts running
· Number of queries running for each script
· Start time
· End time
· Execution time
· Number of stages created
· Number of tasks created
· Executor CPU time
· Total records written
· Total bytes written
· Total records read
· Total bytes read
· Memory bytes spilled
· Peak execution memory
These metrics were required in the form of aggregated reports for the specified applications, which would enable the customer to optimize resource utilization.
The Impetus Approach
Building a Spark metrics collector
As part of the end-to-end migration, we used Apache Oozie for orchestrating Spark SQL scripts in the target environment. To collect the Spark metrics, we needed to obtain the ‘application_id’ of the last successful run of each Spark application, which was scheduled through Oozie. We therefore designed a customized Spark-based tool (using Python as the programming language) and captured the necessary information from the applications’ Spark job logs by following the steps given below:
· Used the Oozie command to get the ‘oozie_id’ of the last successful run of the specific application based on its workflow name
· Got the list of ‘job_ids’ for each action of the Oozie workflow
· Used the Oozie command to get job information for the specific ‘oozie_id’, including the ‘job_id’ of each action used in the Oozie workflow
· Fetched logs of the MapReduce job for each action
· Used the ‘mapred’ command to get the log of each action
· Parsed the log and got the Spark job application_id
· Used this ‘application_id’ to get Spark logs from the Spark history server
· Parsed the Spark logs to get comprehensive statistics
· Prepared the Spark metrics report in a .csv format
This tool can also be leveraged for uses cases where the Spark job ‘application_id’ is available directly, even if Oozie is not the orchestrator.
As the Spark logs were available in JSON format and were extremely large, we implemented the tool in two ways:
· Using single node Spark: We executed the Spark metrics collector on Hortonworks Data Platform (HDP) on single node Spark as it helped us efficiently process the huge log files.
· Using Pandas: Since single node Spark was taking a long time to prepare the required metrics, we leveraged Pandas (a Python library) to process the same data in a shorter timeframe.
Next, the tool was wrapped in a Shell script. Once the tool was deployed, this Shell script became the starting point from where multiple Oozie, Mapred, and Spark commands, and Python classes were executed sequentially.
In addition to collecting Spark metrics, we used the tool to email detailed metrics reports to the relevant stakeholders. Here’s a glimpse of a sample report:
The Spark metrics collection tool could be easily executed by the global enterprise technology provider in ~25min for 21 Teradata applications. It provided vital information about the resource consumption of strategic applications deployed on a cluster. This, in turn, helped them analyze specific consumption patterns and optimize cluster utilization in line with their business needs.