2

I save a RDD with saveAsObjectFile so the temporary files are distributed on driver and executors. At the end of program I want to remove all of these files. How to remove them?

user1803467
  • 125
  • 2
  • 10
  • Possible duplicate http://stackoverflow.com/questions/30093676/apache-spark-does-not-delete-temporary-directories?rq=1 – vvg Dec 08 '16 at 11:29
  • Thanks you very much. But this article is mainly concerning about the temporary files created by spark system. My files are created by my application program. – user1803467 Dec 08 '16 at 11:43

1 Answers1

1

There's no built-in support for deleting data via Spark. However, you can use foreachPartition on the original RDD to run any arbitrary piece of code on each partition, meaning - it would run at least once on each of the executors that actually saved some data.

So - if you run code that deletes the folder you saved into (making sure that it won't fail if it runs more than once on the same executor, as a single executor can hold multiple partitions) you'd get what you need.

For example, using Apache Commons:

// save
rdd.saveAsObjectFile("/my/path")

// use data...

// before shutting down - iterate over saved RDD's partitions and delete folder:
import org.apache.commons.io.FileUtils    
rdd.foreachPartition(i =>
  // deleteDirectory doesn't fail if directory does not exist 
  FileUtils.deleteDirectory(new File("/my/path"))
)

EDIT: note that this is a bit hacky, and might not be 100% bullet-proof: for example, if during the application execution one of the executors crashed, its partitions might be recalculated on other executors hence the data on that executor won't be deleted.

Tzach Zohar
  • 37,442
  • 3
  • 79
  • 85
  • thanks a lot, need I repartition this rdd and set the partition number equal to the number of my spark executors? Otherwise there will too much time of deleting in a executor if I had set a large number to the parallelism of spark system. – user1803467 Dec 08 '16 at 13:37
  • If indeed the number of partitions is extremely large, repartitioning might be helpful, but since this is a quick operation (for most partitions it will just check if file exists) I'd try it as-is and optimize only if necessary. – Tzach Zohar Dec 08 '16 at 13:41