7

I'm trying to understand if the Spark Driver is a single point of failure when deploying in cluster mode for Yarn. So I'd like to get a better grasp of the innards of the failover process regarding the YARN Container of the Spark Driver in this context.

I know that the Spark Driver will run in the Spark Application Master inside a Yarn Container. The Spark Application Master will request resources to the YARN Resource Manager if required. But I haven't been able to find a document with enough detail about the failover process in the event of the YARN Container of the Spark Application Master (and Spark driver) failing.

I'm trying to find out some detailed resources that can allow me to answer some questions related to the following scenario: If the host machine of the YARN Container that runs the Spark Application Master / Spark Driver losses network connectivity for 1 hour:

  1. Does the YARN Resource Manager spawn a new YARN Container with another Spark Application Master/Spark Driver?

  2. In that case (spawning a new YARN Container), does it start the Spark Driver from scratch if at least 1 stage in 1 of the Executors had been completed and notified as such to the original Driver before it failed? Does the option used in persist() make a difference here? And will the new Spark Driver know that the executor had completed 1 stage? Would Tachyon help out in this scenario?

  3. Does a failback process get triggered if network connectivity is recovered in the YARN Container's host machine of the original Spark Application Master? I guess that this behaviour can be controlled from YARN, but I don't know what's the default when deploying SPARK in cluster mode.

I'd really appreciate it if you can point me out to some documents / web pages where the Architecture of Spark in yarn-cluster mode and the failover process are explored in detail.

dtolnay
  • 9,621
  • 5
  • 41
  • 62
MiguelPeralvo
  • 837
  • 1
  • 11
  • 19

1 Answers1

5

We just started running on YARN, so I don't know much. But I'm almost certain we had no automatic failover at the driver's level. (We implemented some on our own.)

I would not expect there to be any default failover solution for the driver. You (the driver author) are the only one who knows how to health-check your application. And the state that lives in the driver is not something that can be automatically serialized. When a SparkContext is destroyed, all the RDDs created in it are lost, because they are meaningless without the running application.

What you can do

The recovery strategy we have implemented is very simple. After every costly Spark operation we make a manual checkpoint. We save the RDD to disk (think saveAsTextFile) and load it back right away. This erases the lineage of the RDD, so it will be reloaded rather than recalculated if a partition is lost.

We also store what we have done and the file name. So if the driver restarts, it can pick up where it left off, at the granularity of such operations.

Daniel Darabos
  • 26,991
  • 10
  • 102
  • 114
  • Hi Daniel, what you said makes sense. But getting my hands on some documentation/resources about the architecture around: How the driver runs in yarn-cluster, its failover process in that context (whatever is provided out of the box), how the lineage gets persisted, if Tachyon or persist(DISK_ONLY) provide some kind of checkpoint that can be leveraged, what happens with orphan stages and executors after the original Driver dies, ... would definitely help me out in the process of designing a more resilient custom failover process for the driver. Thanks! – MiguelPeralvo Jan 20 '15 at 13:42
  • Sorry, I don't know of such documentation. The Spark source code is fairly readable. I've added a description of our (really simple) solution to the answer. Hope it helps! – Daniel Darabos Jan 20 '15 at 16:38
  • Fair enough. Thank you very much, Daniel. – MiguelPeralvo Jan 20 '15 at 17:28
  • Hi Daniel. Regarding "if the driver restarts, it can pick up where it left off, at the granularity of such operations", I don't quite understand this. If nothing ever happens to the driver, then after rdd_a is checkpointed to HDFS and every time rdd_a is requested, it reads data from HDFS other than re-computes the data. However, if the driver does fail and, a new driver starts-up after it fails, this time rdd_a is not yet checkpointed to the acknowledgement of the new driver, although there is a checkpoint file on HDFS. Can you explain how the new driver picks up a little bit please? :) – Kurtt.Lin Feb 10 '15 at 04:05
  • It's too complex to explain in depth here, I'm afraid. The outline is that we have a list of operations (built by the user). Each of these has code to run the calculation on its input RDDs and produce output RDDs. When we need the result, we check for each operation if its output is already on disk. If it's on disk we load it, if it's not on disk, we run the code (then save the output to disk). Either way we get the output RDD and go to the next operation. Hope this makes some sense! – Daniel Darabos Feb 10 '15 at 08:42