1

We are using Spark + Java in our project, and the Hadoop distribution being used is MapR.

In our Spark jobs we persist data (at disk level).

After the job completes, there is lot of temp data inside the /tmp/ folder. How can we ensure that /tmp/ folder (temp data) gets empty after the job execution completes.

I found a link below: Apache Spark does not delete temporary directories

But not sure how to set the following properties:

  • spark.worker.cleanup.enabled

  • spark.worker.cleanup.interval

  • spark.worker.cleanup.appDataTtl

Also, where to set these properties: 1. In Code Or 2. In spark configuration

We are running the job in cluster mode (with master yarn), using spark-submit command.

Thanks Anuj

Anuj Mehra
  • 320
  • 3
  • 19

1 Answers1

0
  1. Create a backup of the spark-env.sh file. Open the file in a text editor (e.g. vi) and locate "SPARK_WORKER_OPTS"

  2. Immediately below this line, add or update:SPARK_WORKER_OPTS="-Dspark.worker.cleanup.enabled=true -Dspark.worker.cleanup.appDataTtl=172800"

  3. This should enable work logs cleanup, and will retain logs for no longer than 48 hours, with a default check time of every 30 minutes.

Restart Spark and done!

Amit Kumar
  • 2,685
  • 2
  • 37
  • 72
  • But this is only for standalone mode. We are running the jobs in cluster mode. As per Spark documentation: spark.worker.cleanup.enabled false Enable periodic cleanup of worker / application directories. Note that this only affects standalone mode, as YARN works differently. Only the directories of stopped applications are cleaned up. spark.worker.cleanup.appDataTtl 604800 (7 days, 7 * 24 * 3600) The number of seconds to retain application work directories on each worker. This is a Time To Live and should depend on the amount of available disk spa – Anuj Mehra Jan 21 '18 at 05:01
  • 1
    Also this is only for cleaning logs. But we are looking to clean the temp directory. – Anuj Mehra Jan 21 '18 at 05:03
  • 1
    Have you found any solution for this ? – MsCurious Apr 28 '22 at 04:32