I am working on a Spark ML pipeline where we get OOM Errors on larger data sets. Before training we were using cache()
; I swapped this out for checkpoint()
and our memory requirements went down significantly. However, in the docs for RDD
's checkpoint()
it says:
It is strongly recommended that this RDD is persisted in memory, otherwise saving it on a file will require recomputation.
The same guidance is not given for DataSet
's checkpoint, which is what I am using. Following the above advice anyways, I found that the memory requirements actually increased slightly from using cache()
alone.
My expectation was that when we do
...
ds.cache()
ds.checkpoint()
...
the call to checkpoint forces evaluation of the DataSet
, which is cached at the same time before being checkpointed. Afterwards, any reference to ds
would reference the cached partitions, and if more memory is required and the partitions are evacuated that the checkpointed partitions will be used rather than re-evaluating them. Is this true, or does something different happen under the hood? Ideally I'd like to keep the DataSet in memory if possible, but it seems there is no benefit whatsoever from a memory standpoint to using the cache and checkpoint approach.