Remove temporary files on spark driver and executors

Question

I save a RDD with saveAsObjectFile so the temporary files are distributed on driver and executors. At the end of program I want to remove all of these files. How to remove them?

Possible duplicate http://stackoverflow.com/questions/30093676/apache-spark-does-not-delete-temporary-directories?rq=1 — vvg, Dec 08 '16 at 11:29
Thanks you very much. But this article is mainly concerning about the temporary files created by spark system. My files are created by my application program. — user1803467, Dec 08 '16 at 11:43

score 1 · Accepted Answer · answered Dec 08 '16 at 12:41

There's no built-in support for deleting data via Spark. However, you can use foreachPartition on the original RDD to run any arbitrary piece of code on each partition, meaning - it would run at least once on each of the executors that actually saved some data.

So - if you run code that deletes the folder you saved into (making sure that it won't fail if it runs more than once on the same executor, as a single executor can hold multiple partitions) you'd get what you need.

For example, using Apache Commons:

// save
rdd.saveAsObjectFile("/my/path")

// use data...

// before shutting down - iterate over saved RDD's partitions and delete folder:
import org.apache.commons.io.FileUtils    
rdd.foreachPartition(i =>
  // deleteDirectory doesn't fail if directory does not exist 
  FileUtils.deleteDirectory(new File("/my/path"))
)

EDIT: note that this is a bit hacky, and might not be 100% bullet-proof: for example, if during the application execution one of the executors crashed, its partitions might be recalculated on other executors hence the data on that executor won't be deleted.

thanks a lot, need I repartition this rdd and set the partition number equal to the number of my spark executors? Otherwise there will too much time of deleting in a executor if I had set a large number to the parallelism of spark system. — user1803467, Dec 08 '16 at 13:37
If indeed the number of partitions is extremely large, repartitioning might be helpful, but since this is a quick operation (for most partitions it will just check if file exists) I'd try it as-is and optimize only if necessary. — Tzach Zohar, Dec 08 '16 at 13:41

Remove temporary files on spark driver and executors

1 Answers1