2

I want to understand how spark determines the number of csv files it creates while saving a data frame as csv file. Does the number of partitions affect this number? and why are some empty files created? I have the code like follows

dataframe.coalesce(numPartitions).write
   .format("com.databricks.spark.csv")
   .option("delimiter", "|")
   .option("header", "true")
   .mode("overwrite")
   .save("outputpath")
zero323
  • 322,348
  • 103
  • 959
  • 935
user3104078
  • 107
  • 1
  • 7

1 Answers1

4

There are multiples files when you save in csv or any other format, Its because of a multiple number of the partition of your dataframe. If you have n number of partition then you get n number of files saved in output.

Does the number of partitions affect this number?

Yes, the number of partition is equal to the number of files. While saviong the datarfame/rdd each partition is written as a single file.

why are some empty files created?

All the partitions may not contain data

Hope this helps!

koiralo
  • 22,594
  • 6
  • 51
  • 72