3

The spark UI has an SQL tab. It can show the query detail as a DAG

https://www.cloudera.com/documentation/enterprise/5-9-x/topics/operation_spark_applications.html

After the application finishes, the DAG also annotates its nodes with statistic information. For example,

number of output rows: 155,418,058

peak memory total (min, med, max): 
24.1 GB (704.0 MB, 704.0 MB, 704.0 MB)

aggregate time total (min, med, max): 
15.6 m (20.8 s, 25.5 s, 42.1 s)

Exchange data size total (min, med, max): 
1350.1 MB (2.2 MB, 2.3 MB, 2.3 MB)

Does Spark have any API to fetch the metics? Spark has https://spark.apache.org/docs/latest/monitoring.html#executor-task-metrics accessed by RESTful APIs. And the stage tab on Spark UI also shows "Summary Metrics" of each task. However

1) I am not sure how to relate task ID to RDD or nodes on a query DAG

2) the Peak Execution Memory metric is always 0, while as we can see the SQL tab can show

peak memory total (min, med, max): 
24.1 GB (704.0 MB, 704.0 MB, 704.0 MB)

The other question is how to read the metrics on DAG nodes. For example,

peak memory total (min, med, max): 
24.1 GB (704.0 MB, 704.0 MB, 704.0 MB)

Is min, med, max for a node? Its value is much smaller than the total 24.1G ...

Joe C
  • 2,757
  • 2
  • 26
  • 46

0 Answers0