Why Spark creates multiple csv files while saving a dataframe in csv format?

Question

I want to understand how spark determines the number of csv files it creates while saving a data frame as csv file. Does the number of partitions affect this number? and why are some empty files created? I have the code like follows

dataframe.coalesce(numPartitions).write
   .format("com.databricks.spark.csv")
   .option("delimiter", "|")
   .option("header", "true")
   .mode("overwrite")
   .save("outputpath")

Possible duplicate of [How to write to CSV in Spark](https://stackoverflow.com/questions/23527941/how-to-write-to-csv-in-spark) — David, Mar 28 '18 at 18:28

score 4 · Accepted Answer · answered Mar 28 '18 at 18:49

There are multiples files when you save in csv or any other format, Its because of a multiple number of the partition of your dataframe. If you have n number of partition then you get n number of files saved in output.

Does the number of partitions affect this number?

Yes, the number of partition is equal to the number of files. While saviong the datarfame/rdd each partition is written as a single file.

why are some empty files created?

All the partitions may not contain data

Hope this helps!

Why Spark creates multiple csv files while saving a dataframe in csv format?

1 Answers1