2

Let me elaborate my question:

I am using a cluster which contains a master node and 3 worker node, my master

node has spark context available.

I have saved my RDD into the disk using storage level "DISK_ONLY".

When I run my spark script it will save some RDD to hard disk of any worker

node, now when my master machine goes down, which has spark context and as a

result it will also go down, thus all the DAG information lost.

Now I have to restart my master node so as to make spark context up and

running again.

now the question is - will I be able to retain all saved RDD back with this

bouncing (restarting master node and spark context daemon)? as everything is

restarted.

thebluephantom
  • 16,458
  • 8
  • 40
  • 83
intellect_dp
  • 169
  • 1
  • 12

3 Answers3

4

I don't think currently there is a way to restore cached RDD after shutting down the Spark Context. The component that puts and gets RDD blocks is the BlockManager component of Spark. This, in turn, uses another component named BlockInfoManager to keep track of RDD block info. When a BlockManager shuts down in a worker node, it clears the resources that it was using. Among them is the BlockInfoManager, which has the HashMap containing the RDD block info. As this Map is also cleared in the process of cleaning up, when next time it is instantiated, there is no info of any RDD blocks being saved in that worker. Thus it will treat that block as uncomputed.

mkhan
  • 621
  • 4
  • 10
1

According to @intellect_dp explanation, if you are using any cluster manager for example - Apache Mesos or Hadoop Yarn, then you need to specify which deploy mode you want to go with , "cluster mode" or "client mode",

Deploy mode distinguishes where the driver process runs. In "cluster" mode, the framework launches the driver inside of the cluster. In "client" mode, the submitter launches the driver outside of the cluster.

intellect_dp
  • 169
  • 1
  • 12
  • i am using yarn as the cluster management system, deploy cluster mode, but does that mean i need to use client mode to retain my RDDs – intellect_dp Apr 07 '19 at 14:05
  • 1
    My point is u need a solution where driver node and spark context they both are at different node, and keep SC node fault tolerant with help of yarn, in case SC node goes down – deepika patel Apr 07 '19 at 14:09
  • Not sure how question answered with this explanation. – thebluephantom Apr 07 '19 at 14:49
  • @thebluephantom - My explanation to the question, is inclining towards the ans where one need to keep SC node fault tolerant, so that next time if want to retain RDDs, SC's DAG will help us to retain all the RDDs – deepika patel Apr 07 '19 at 17:57
  • 1
    Others posts on SO point to it being different. I have to say - I am going to research this, and it is not an easy topic to understand. I am not sure the persist helps in this way, another topic for next weekend. Of course with Zookeeper you can have a standbyby Master. – thebluephantom Apr 07 '19 at 18:06
  • Yes, i admit this is not that easy, for sure it will include experiment and research – deepika patel Apr 07 '19 at 18:10
1

The short answer is NO. Best to failover your Master.

Alternatively or complimentary you could split up your jobs using a scheduler and use Spark bucketBy approach.

thebluephantom
  • 16,458
  • 8
  • 40
  • 83