I am running a very long-running batch job. It generates a lot of OOM exceptions. To minimize this problem added checkpoints()
Where should I set the checkpoint dir to? The location has to be accessible to all the executors. Currently, I am using a bucket. Based on log files I can see that my code has progressed past several of the checkpoint() calls however the bucket is empty
sparkContext.setCheckpointDir("gs://myBucket/checkpointDir/")
based on CPU utilization and log messages, it looks like my job is still running and making progress after. any idea what the spark where the checkpoint data?
2022-01-22 18:38:06 WARN DAGScheduler:69 - Broadcasting large task binary with size 4.9 MiB
2022-01-22 18:47:23 WARN BlockManagerMasterEndpoint:69 - No more replicas available for broadcast_50_piece0 !
2022-01-22 18:47:23 WARN BlockManagerMaster:90 - Failed to remove broadcast 50 with removeFromMaster = true - org.apache.spark.SparkException: Could not find BlockManagerEndpoint1.
at org.apache.spark.rpc.netty.Dispatcher.postMessage(Dispatcher.scala:176)
kind regards
Andy