2

I'm running a job on a 2 machine 2.1.0 Spark cluster.

I'm trying to save a Dataframe to a CSV file (or multiple ones, it doesn't matter). When I use:

df.write
  .options(options)
  .csv(finalPath)

It successfully saves the data into csv files per partition. On one of my machine it creates the .csv files as part-XXXX files inside the directory I entered, which is great. But on the other machine, it creates a _temporary/0/ subdirectory inside the directory I entered and the files there are in the format task_XXXX, and this behaviour is less great.

Why does that happen? And is there a way for it to be written like in the first machine? Without creating the _temporary/0/ subdirectories?

Thanks in advance :)

sid802
  • 315
  • 2
  • 18
  • It is my understanding that spark does this when tasks spill to disk for things such as check pointing and shuffle operations. Spark is supposed to clean those files up when the run is complete. Check out this question and it's answers for more information: https://stackoverflow.com/questions/30093676/apache-spark-does-not-delete-temporary-directories – Jeremy Jul 05 '17 at 13:26
  • But I don't want them cleaned up, they contain parts of my dataframe that were partitioned on that machine. And my issue is different, the issue you referred to describes directories being created in Spark's temporary folder, mine is created as a temporary directory in the CSV's output file and shouldn't actually be temporary, it should be my permanent output – sid802 Jul 05 '17 at 14:01
  • I guess that I am not understanding your issue. The number of records in your dataframe before it writes is not the same as the number records in the CSV files that are in the correct directory? Don't count the records in the temporary directory. – Jeremy Jul 05 '17 at 15:16
  • No, the number is different. Only if I count the records inside the _temporary directory as well, is the count correct – sid802 Jul 06 '17 at 06:16
  • I think to approach this question, more context information is required: How are you running this job, what filesystem are you using to save to, does the job finish correctly? – Rick Moritz Jul 06 '17 at 13:29
  • @sid802 Are you seeing any errors in from the FileOutputCommitter? Specifically the [commitTask](https://github.com/apache/hadoop/blob/master/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/main/java/org/apache/hadoop/mapreduce/lib/output/FileOutputCommitter.java#L442) method? – Jeremy Jul 06 '17 at 16:32

0 Answers0