5

I am running a spark application in 'local' mode. It's checkpointing correctly to the directory defined in the checkpointFolder config. However, there are two issues that I am seeing that are causing some disk space issues.

1) As we have multiple users running the application, the checkpoint folder on server is created by the first user executing it, which causes other user's run to fail due to permissions issue on the OS. Is there a way to provide a relative path in the checkpointFolder, for example checkpointFolder=~/spark/checkpoint?

2) I have used the spark.worker.cleanup.enabled=true config to cleanup the checkpoint folder after the run, but don't see that happening. Is there an alternate way of cleaning it up through the app, instead of resorting to some cron job?

Kashyapgv
  • 57
  • 6

2 Answers2

0

Hope the following is sensible:

1) You may create unique folder each time like /tmp/spark_checkpoint_1578032476801

2a) You may just delete folder at the end of the app.

2b) If you use HDFS for checkpointing then use such code

  def cleanFS(sc: SparkContext, fsPath: String) = {
    val fs = org.apache.hadoop.fs.FileSystem.get(new URI(fsPath), sc.hadoopConfiguration)
    fs.delete(new Path(fsPath), true)
  }
StanislavKo
  • 363
  • 3
  • 8
0

Check this answer out!

PySpark: fully cleaning checkpoints

I was facing the same issue and it is solved in the above link!

Jatin Chauhan
  • 327
  • 1
  • 2
  • 10