I understand that localCheckpoint
remove the history necessary to rebuild the RDD. And cache
is saving the current state of the RDD so it does not need to rebuild it.
However, I am confused on a few aspects. If I do localCheckpoint
, and I need this RDD later in my code, I often get an Exception
about how the partition is not found anymore.
I looked at the Storage
tab in the sparkUI and it says that only a small fraction of the RDD was saved, like 17%.
So I read more and realize that spark will discard old RDDs. Is there a way for Spark to keep it forever ?
Also, if I was doing cache
instead of localCheckpoint
, would the problem be solved ? But it will just take time as Spark will have to recompute the partition ?
Overall, I just want to keep a RDD in memory for a big part of my job to be able to merge it back at the very end but by the time I get there, Spark have removed it. How do I solve that ?
Does doing localCheckpoint.cache
or cache.localCheckpoint
do anything ? Or one or the other is enough ?