0

Need to write a DataFrame to a csv file with the file name changes according to some Iteration index (idx):

for (idx <- 1 to 3)
  // do some operation and generate a df that depends on idx   
  ...
  df.coalesce(1).write.csv("/temp/path/file#.csv")

The # should vary as the idx changes (in other words, the file name should be sequentially file1.csv, file2.csv, file3.csv as the iteration goes). This seems to be a very common problem, but I have not found a clear solution yet in Scala. Thanks!

Guanghua Shu
  • 95
  • 4
  • 14
  • What do you want to iterate over? If you `coalesce`, there will be only *one* file. If you don't `coalesce`, the files will be numbered automatically. – Andrey Tyukin Apr 05 '18 at 21:30
  • @AndreyTyukin Sorry for the confusion. Could you please see if my edits make sense to you now? – Guanghua Shu Apr 05 '18 at 21:46

1 Answers1

0

The classic way would be:

for (idx <- 1 to 3)
  // do some operation and generate a df that depends on idx   
  ...
  df.coalesce(1).write.csv("/temp/path/file_" + idx + ".csv")

the fancier new way is

for (idx <- 1 to 3)
  // do some operation and generate a df that depends on idx   
  ...
  df.coalesce(1).write.csv(s"/temp/path/file_${idx}.csv")
Andrey Tyukin
  • 43,673
  • 4
  • 57
  • 93
  • Thanks Andrey, I think both options work. But both generate a directory with the name "file_1.csv", and inside this directory, there is a csv file that starts with "part" followed by some very long ids. Any idea why it is not generating a file named "file_1.csv" inside "/temp/path/" directory? – Guanghua Shu Apr 05 '18 at 22:34
  • @GuanghuaShu because it uses the same code independtly of the number of partitions. If there is more than one partition, then every partition will dump its data in its own `part`-file within the directory, because merging everything into a single file would be to slow in distributed file systems. For the case that there is only one partition, it still does the same, to make the implementation simpler and the behavior more consistent. Take a look at [this question](https://stackoverflow.com/questions/31674530/write-single-csv-file-using-spark-csv). – Andrey Tyukin Apr 05 '18 at 22:39
  • I understand why there is only one .csv file. The part I am not sure is that I assume what we passed to .csv(s"/temp/path/file_${idx}.csv") should be a fileName, not a directoryName. It seems to me that this is only take as a directoryName, and we CANNOT specify the fileName. Am I right? – Guanghua Shu Apr 05 '18 at 23:00
  • @GuanghuaShu I'm not aware of a way to specify the filename directly (but it does not mean that there is no way). Everyone else (e.g. [here](https://gist.github.com/dmpetrov/a4a5dc2cc8719619410e37dedde5130e), [here](https://forums.databricks.com/questions/2848/how-do-i-create-a-single-csv-file-from-multiple-pa.html)) seems to move and rename the resulting `part`-files after they have been created. That's at least what I've found. – Andrey Tyukin Apr 05 '18 at 23:07