I save a RDD with saveAsObjectFile so the temporary files are distributed on driver and executors. At the end of program I want to remove all of these files. How to remove them?
-
Possible duplicate http://stackoverflow.com/questions/30093676/apache-spark-does-not-delete-temporary-directories?rq=1 – vvg Dec 08 '16 at 11:29
-
Thanks you very much. But this article is mainly concerning about the temporary files created by spark system. My files are created by my application program. – user1803467 Dec 08 '16 at 11:43
1 Answers
There's no built-in support for deleting data via Spark. However, you can use foreachPartition
on the original RDD to run any arbitrary piece of code on each partition, meaning - it would run at least once on each of the executors that actually saved some data.
So - if you run code that deletes the folder you saved into (making sure that it won't fail if it runs more than once on the same executor, as a single executor can hold multiple partitions) you'd get what you need.
For example, using Apache Commons:
// save
rdd.saveAsObjectFile("/my/path")
// use data...
// before shutting down - iterate over saved RDD's partitions and delete folder:
import org.apache.commons.io.FileUtils
rdd.foreachPartition(i =>
// deleteDirectory doesn't fail if directory does not exist
FileUtils.deleteDirectory(new File("/my/path"))
)
EDIT: note that this is a bit hacky, and might not be 100% bullet-proof: for example, if during the application execution one of the executors crashed, its partitions might be recalculated on other executors hence the data on that executor won't be deleted.

- 37,442
- 3
- 79
- 85
-
thanks a lot, need I repartition this rdd and set the partition number equal to the number of my spark executors? Otherwise there will too much time of deleting in a executor if I had set a large number to the parallelism of spark system. – user1803467 Dec 08 '16 at 13:37
-
If indeed the number of partitions is extremely large, repartitioning might be helpful, but since this is a quick operation (for most partitions it will just check if file exists) I'd try it as-is and optimize only if necessary. – Tzach Zohar Dec 08 '16 at 13:41