2

I am trying to create a state diagram of a submitted spark application. I and kind of lost on when then an application is considered FAILED.

States are from here: https://github.com/apache/spark/blob/d6dc12ef0146ae409834c78737c116050961f350/core/src/main/scala/org/apache/spark/deploy/master/DriverState.scala

enter image description here

Community
  • 1
  • 1
Aravind Yarram
  • 78,777
  • 46
  • 231
  • 327

1 Answers1

1

This stage is very important, since when it comes to Big Data, Spark is awesome, but let's face it, we haven't solve the problem yet!


When a task/job fails, Spark restarts it (recall that the RDD, the main abstraction Spark provides, is a Resilient Distributed Dataset, which is not what we are looking for here, but it would give the intuition).

I use Spark 1.6.2 and my cluster restarts the job/task 3 times, when it is marked as FAILED.

For example, one of my recent jobs had to restart a whole stage:

enter image description here

In the cluster/app, one can see the attempt IDs, here the application is in its 3rd and final attempt:

enter image description here

If that attempt is marked as FAILED (for whatever reason, e.g. out-of-memory, bad DNS, GC allocation memory, disk failed, node didn't respond to the 4 heartbeats (probably is down), etc.), then Spark relaunches the job.

gsamaras
  • 71,951
  • 46
  • 188
  • 305
  • so essentialy this means that a failed job can be relaunched the configurable number of times...at which point it can be in RELAUNCHING stage – Aravind Yarram Aug 26 '16 at 18:16
  • @Pangea when Spark notices that the first attempt for example failed, it will go on and schedule the re-launch of the job. However, this will not be made instantly! For example, my job has 2000 executors. Once the 1st attempt fails, it will lose *all* its executors and all the progress it has done. That means, that the 2nd attempt, will start from scratch, that means that the job is ACCEPTED, but will have to wait to be in the RUNNING state by the Scheduler (in that phase our job is in the RELAUNCHING phase, i.e. the process that Spark has to do to prepare the ground for the 2nd attempt to run – gsamaras Aug 26 '16 at 18:19
  • there seems to be no ACCEPTED state https://github.com/apache/spark/blob/d6dc12ef0146ae409834c78737c116050961f350/core/src/main/scala/org/apache/spark/deploy/master/DriverState.scala – Aravind Yarram Aug 26 '16 at 18:38
  • @Pangea the job gets ACCEPTED, not the driver. But in any case, you can ignore it if you wish (I mean it might depend on the Scheduler). Check it [here](http://stackoverflow.com/questions/30828879/application-report-for-application-state-accepted-never-ends-for-spark-submi). In general, ACCEPTED means that the job is accepted on the cluster, but is not yet RUNNING. – gsamaras Aug 26 '16 at 18:40
  • Thanks. It would be great if you can help me complete this state diagram from stanalone scheduler. I can add this to gitbook it you want to contribute – Aravind Yarram Aug 26 '16 at 18:48
  • @Pangea I have a real cluster, a corporate one. However, tell me what info more you want and I will give it, a contribution would be nice. – gsamaras Aug 26 '16 at 18:52
  • I will comment here after i made this available on gitbook. thanks – Aravind Yarram Aug 26 '16 at 18:54