Storing intermediate data in Spark when there are 100s of operations in an application

Question

An RDD is inherently fault-tolerant due to its lineage. But if an application has 100s of operations it would get difficult to reconstruct going through all those operations. Is there a way to store the intermediate data? I understand that there are options of persist()/cache() to hold the RDDs. But are they good enough to hold the intermediate data? Would check-pointing be an option at all? Also is there a way specify the level of storage when check-pointing RDD?(like MEMORY or DISK etc.,)

From what I know, no. `checkpoint` "will be saved to a file inside the checkpoint directory set with `SparkContext#setCheckpointDir`" - this will usually be something like HDFS or S3. — Ton Torres, May 06 '16 at 01:49

score 1 · Accepted Answer · edited May 23 '17 at 12:31

1

While cache() and persist() is generic checkpoint is something which is specific to streaming.

caching - caching might happen on memory or disk

rdd.cache()

persist - you can give option where you want to persist your data either in memory or disk

rdd.persist(storage level)

checkpoint - you need to specify a directory where you need to save your data (in reliable storage like HDFS/S3)

val ssc = new StreamingContext(...)   // new context

ssc.checkpoint(checkpointDirectory)   // set checkpoint directory

There is a significant difference between cache/persist and checkpoint.

Cache/persist materializes the RDD and keeps it in memory and / or disk. But the lineage of RDD (that is, seq of operations that generated the RDD) will be remembered, so that if there are node failures and parts of the cached RDDs are lost, they can be regenerated.

However, checkpoint saves the RDD to an HDFS file AND actually FORGETS the lineage completely. This is allows long lineages to be truncated and the data to be saved reliably in HDFS (which is naturally fault tolerant by replication).

http://apache-spark-user-list.1001560.n3.nabble.com/checkpoint-and-not-running-out-of-disk-space-td1525.html

(Why) do we need to call cache or persist on a RDD

edited May 23 '17 at 12:31

Community

1
1

answered May 06 '16 at 02:28

infiniti

61
6

Thanks for the very detailed response. If checkpointing is only for streaming then is there a way to store intermediate data in normal RDDs? Or is persist the only option? – spark_dream May 06 '16 at 03:18
persist gives you an option to store your intermediate data on disk by mentioning in storage level for eg. - rdd.persist(MEMORY_AND_DISK). It saves data in memory as well as disk. You can check other option in spark doc - http://spark.apache.org/docs/latest/programming-guide.html#rdd-persistence – infiniti May 06 '16 at 03:37
1

_checkpoint is something which is specific to streaming_ - it is not. Also it covers different concepts in Spark (data checkpointing, metadata checkpointing, local checkpoints and so on). – zero323 May 06 '16 at 11:33
if you can elaborate and write an answer on how to checkpoint non streaming rdd's, it will help everyone – infiniti May 06 '16 at 15:03

Storing intermediate data in Spark when there are 100s of operations in an application

1 Answers1