1

Three questions of similarity:

what will happen if one my one executor is lost.

what will happen if my driver is lost.

What will happen in case of stage failure.

In all the above cases, are they recoverable? If yes, how to recover. Is there any option in "SparkConf", setting which these can be prevented from?

Thanks.

Rock
  • 17
  • 4

1 Answers1

0

Spark use job scheduling. DAGScheduler is implemented by cluster managers (Standalone, YARN, Mesos), and your cluster manager can re-schedule the failed task.

For example, if you use YARN, try tweaking spark.yarn.maxAppAttempts and yarn.resourcemanager.am.max-attempts. Also, you can try to manually track jobs using the HTTP API: https://community.hortonworks.com/articles/28070/starting-spark-jobs-directly-via-yarn-rest-api.html

If you want to recover from logical errors, you can try checkpointing (saving records to HDFS for later use): https://mallikarjuna_g.gitbooks.io/spark/content/spark-streaming/spark-streaming-checkpointing.html. (For really long and important pipelines I recommend saving your data in normal files instead of checkpoints!).

Configuring high-available clusters is a more complex task than tweaking 1 setting in SparkConf. You can try to implement different scenarios and return with more detailed questions. As a first step, you can try to run everything on YARN.

Oleg Chirukhin
  • 970
  • 2
  • 7
  • 22
  • Ok thanks. But I have a question. Is Spark's self recovery mode only/better works in Cluster mode than Client mode? And if I turn on the dynamic allocation mode, then chance of loosing driver and executer is less I guess in cluster mode? Please explain if possible. – Rock Feb 14 '21 at 16:57
  • @Rock Cluster mode is better. Here's some explanation: https://stackoverflow.com/a/40026542/368115 – Oleg Chirukhin Feb 14 '21 at 20:47