Cache vs localCheckpoint and how to stop spark from removing it?

Question

I understand that localCheckpoint remove the history necessary to rebuild the RDD. And cache is saving the current state of the RDD so it does not need to rebuild it.

However, I am confused on a few aspects. If I do localCheckpoint , and I need this RDD later in my code, I often get an Exception about how the partition is not found anymore.

I looked at the Storage tab in the sparkUI and it says that only a small fraction of the RDD was saved, like 17%.

So I read more and realize that spark will discard old RDDs. Is there a way for Spark to keep it forever ?

Also, if I was doing cache instead of localCheckpoint, would the problem be solved ? But it will just take time as Spark will have to recompute the partition ?

Overall, I just want to keep a RDD in memory for a big part of my job to be able to merge it back at the very end but by the time I get there, Spark have removed it. How do I solve that ?

Does doing localCheckpoint.cache or cache.localCheckpoint do anything ? Or one or the other is enough ?

Possible duplicate of [What is the difference between spark checkpoint and persist to a disk](https://stackoverflow.com/questions/35127720/what-is-the-difference-between-spark-checkpoint-and-persist-to-a-disk) — zero323, Oct 04 '18 at 17:08

score 3 · Answer 1 · answered Oct 04 '18 at 17:06

Is there a reason you need to use localCheckpoint vs checkpoint? When using a localCheckpoint your truncating w/o replicating, which is faster but much less reliable, this may be where your having trouble.

General differences in where they are saved:

cache is saving to memory(if to large for mem to disk), checkpoint is saving directly to disk. cache and persist can be overwritten if the memory fills up (both by yourself or someone else if they are working on the same cluster), and will be cleared if your cluster is terminated or restarted. checkpoint will persist to HDFS or local storage, and will only be deleted if done manually. Each have different purposes.

More details (highly recommend reading):

https://github.com/JerryLead/SparkInternals/blob/master/markdown/english/6-CacheAndCheckpoint.md

Does doing localCheckpoint.cache or cache.localCheckpoint do anything ? Or one or the other is enough ?

cache before you checkpoint. checkpoint runs on it's own job, so if the RDD is cached it will pull from the cache instead of re-running it.

What about `persist(Disk)` vs `checkpoint` ? `checkpoint` and `localCheckpoint` are both removing the history, right ? So is (`checkpoint` == `persist(Disk)` + remove history) and (`localCheckpoint` == `persist(memory)` + remove history) ? — Wonay, Oct 04 '18 at 19:50

score 1 · Answer 2 · edited Mar 10 '19 at 11:17

1

Set spark.dynamicAllocation.cachedExecutorIdleTimeout to a high value if you want to keep an RDD in memory for a long part of your job.

edited Mar 10 '19 at 11:17

Mazin Ibrahim

7,433
2
33
40

answered Mar 10 '19 at 08:40

Gunjan Kumar

11
3

Cache vs localCheckpoint and how to stop spark from removing it?

2 Answers2