An RDD is inherently fault-tolerant due to its lineage. But if an application has 100s of operations it would get difficult to reconstruct going through all those operations. Is there a way to store the intermediate data? I understand that there are options of persist()/cache() to hold the RDDs. But are they good enough to hold the intermediate data? Would check-pointing be an option at all? Also is there a way specify the level of storage when check-pointing RDD?(like MEMORY or DISK etc.,)
-
I think `rdd.checkpoint` is what you need – Ton Torres May 06 '16 at 01:06
-
@TonTorres can we specify storage level when checkpointing? – spark_dream May 06 '16 at 01:35
-
From what I know, no. `checkpoint` "will be saved to a file inside the checkpoint directory set with `SparkContext#setCheckpointDir`" - this will usually be something like HDFS or S3. – Ton Torres May 06 '16 at 01:49
1 Answers
While cache() and persist() is generic checkpoint is something which is specific to streaming.
caching - caching might happen on memory or disk
rdd.cache()
persist - you can give option where you want to persist your data either in memory or disk
rdd.persist(storage level)
checkpoint - you need to specify a directory where you need to save your data (in reliable storage like HDFS/S3)
val ssc = new StreamingContext(...) // new context
ssc.checkpoint(checkpointDirectory) // set checkpoint directory
There is a significant difference between cache/persist and checkpoint.
Cache/persist materializes the RDD and keeps it in memory and / or disk. But the lineage of RDD (that is, seq of operations that generated the RDD) will be remembered, so that if there are node failures and parts of the cached RDDs are lost, they can be regenerated.
However, checkpoint saves the RDD to an HDFS file AND actually FORGETS the lineage completely. This is allows long lineages to be truncated and the data to be saved reliably in HDFS (which is naturally fault tolerant by replication).
-
Thanks for the very detailed response. If checkpointing is only for streaming then is there a way to store intermediate data in normal RDDs? Or is persist the only option? – spark_dream May 06 '16 at 03:18
-
persist gives you an option to store your intermediate data on disk by mentioning in storage level for eg. - rdd.persist(MEMORY_AND_DISK). It saves data in memory as well as disk. You can check other option in spark doc - http://spark.apache.org/docs/latest/programming-guide.html#rdd-persistence – infiniti May 06 '16 at 03:37
-
1_checkpoint is something which is specific to streaming_ - it is not. Also it covers different concepts in Spark (data checkpointing, metadata checkpointing, local checkpoints and so on). – zero323 May 06 '16 at 11:33
-
if you can elaborate and write an answer on how to checkpoint non streaming rdd's, it will help everyone – infiniti May 06 '16 at 15:03