2

I am running a very long-running batch job. It generates a lot of OOM exceptions. To minimize this problem added checkpoints()

Where should I set the checkpoint dir to? The location has to be accessible to all the executors. Currently, I am using a bucket. Based on log files I can see that my code has progressed past several of the checkpoint() calls however the bucket is empty

sparkContext.setCheckpointDir("gs://myBucket/checkpointDir/")

based on CPU utilization and log messages, it looks like my job is still running and making progress after. any idea what the spark where the checkpoint data?

2022-01-22 18:38:06 WARN  DAGScheduler:69 - Broadcasting large task binary with size 4.9 MiB
2022-01-22 18:47:23 WARN  BlockManagerMasterEndpoint:69 - No more replicas available for broadcast_50_piece0 !
2022-01-22 18:47:23 WARN  BlockManagerMaster:90 - Failed to remove broadcast 50 with removeFromMaster = true - org.apache.spark.SparkException: Could not find BlockManagerEndpoint1.
    at org.apache.spark.rpc.netty.Dispatcher.postMessage(Dispatcher.scala:176)

kind regards

Andy

AEDWIP
  • 888
  • 2
  • 9
  • 22
  • Did you manually trigger checkpoint in your code? If not, it won't be automatically triggered. See https://programmer.help/blogs/spark_-correct-use-of-checkpoint-in-spark-and-its-difference-from-cache.html Checkpointing is generally not a way to solve OOM problem in Spark. – Dagang Jan 23 '22 at 22:25

1 Answers1

0

Did you manually trigger checkpoint in your code? If not, it won't be automatically triggered. See https://programmer.help/blogs/spark_-correct-use-of-checkpoint-in-spark-and-its-difference-from-cache.html Checkpointing is generally not a way to solve OOM problem in Spark.

Dagang
  • 24,586
  • 26
  • 88
  • 133