4

I already saw this question How to implement custom job listener/tracker in Spark? and checked the source code to find out how to get the number of stages per job but is there any way to track programatically the % of jobs that got completed in a Spark app?

I can probably get the number of finished jobs with the listeners but I'm missing the total number of jobs that will be run.

I want to track progress of the whole app and it creates quite a few jobs but I can't find to find it anywhere.

@Edit: I know there's a REST endpoint for getting all the jobs in an app but:

  1. I would prefer not to use REST but to get it in the app itself (spark running on AWS EMR/Yarn - getting the address probably is doable but I'd prefer to not do it)
  2. that REST endpoint seems to be returning only jobs that are running/finished/failed so not total number of jobs.
Community
  • 1
  • 1
Mateusz Dymczyk
  • 14,969
  • 10
  • 59
  • 94
  • You have it in the UI at: http://master-host:4040 – Avihoo Mamka Mar 03 '16 at 12:18
  • @AvihooMamka I need to get it programatically somehow, my users don't have access to that UI so I need to show that progress in my app – Mateusz Dymczyk Mar 03 '16 at 12:23
  • Try this maybe: http://stackoverflow.com/questions/27165194/how-to-get-spark-job-status-from-program – Avihoo Mamka Mar 03 '16 at 12:35
  • right, I know about the REST API but since I'm deploying it on AWS EMR with YARN getting the URL is a pain and I'd prefer to do it in my spark job and just ping my app from it. Trying to find Sparks Web Server code to see how they get the list of jobs :-) – Mateusz Dymczyk Mar 03 '16 at 12:36

1 Answers1

2

After going through the source code a bit I guess there's no way to see upfront how many jobs will there be since I couldn't find any place where Spark would be doing such analysis upfront (as jobs are submitted in each action independently Spark doesn't have a big picture of all the jobs from the start).

This kind of makes sense because of how Spark divides work into:

  • jobs - which are started whenever the code which is run on the driver node encounters an action (i.e. collect(), take() etc.) and are supposed to compute a value and return it to the driver
  • stages - which are composed of sequences of tasks between which no data shuffling is required
  • tasks - computations of the same type which can run in parallel on worker nodes

So we do need to know stages and tasks upfront for a single job to create the DAG but we don't necessarily need to create a DAG of jobs, we can just create them "as we go".

Mateusz Dymczyk
  • 14,969
  • 10
  • 59
  • 94