Pyspark dataframe partition number

Question

I have a data frame df, I want to partition it by date (a column in the df). I have the code below:

df.write.partitionBy('date').mode(overwrite').orc('path')

Then under the path above, there are bunch folders, e.g. date=2018-10-08 etc... But under the folder date=2018-10-08, there are 5 files, what I want is to reduce to only one file inside the date=2018-10-08 folder. How to do that? I still want it partitioned by date.

Thank you in advance!

score 2 · Accepted Answer · answered Oct 10 '18 at 01:04

2

In order to have 1 file per partition folder you will need to repartition the data by the partition column before writing. This will shuffle the data so the dates are in the same DataFrame/RDD partitions:

df.repartition('date').write.partitionBy('date').mode(overwrite').orc('path')

answered Oct 10 '18 at 01:04

Silvio

3,947
21
22

df.coalsece(1).write.partitionBy('date').mode(overwrite').orc('path'). I think since we are reducing the partitions size from higher to lower, coalsece would cause less shuffling of data. What you say? – vikrant rana Oct 30 '18 at 12:50
Using coalesce(1) will reduce the parallelization of the whole pipeline to a single core. So, yes it will reduce shuffle but negatively impact performance at the same time. Repartitioning by date in this case will ensure each partition folder will result in a single file. – Silvio Oct 30 '18 at 14:32

Pyspark dataframe partition number

1 Answers1