Spark Dataframe Repartioning Causing uneven paritions

Asked Jan 28 '18 at 12:37

Active Aug 12 '18 at 07:30

Viewed 218 times

I am using spark repartition to change the number of partitions in the dataframe.

While writing the data after repartitioning I saw there are different size parquet files have been created.

Here is the code which I am using to repartition

df.repartition(partitionCount).write.mode(SaveMode.Overwrite).parquet("/test")

Most of the partitions in size KBs and some of them is in around 100MB which is the size I want to keep per partition.

Here is a sample

20.2 K  /test/part-00010-0957f5aa-1f14-4295-abe2-0aacfe135444.snappy.parquet
20.2 K  /test/part-00011-0957f5aa-1f14-4295-abe2-0aacfe135444.snappy.parquet
99.9 M  /test/part-00012-0957f5aa-1f14-4295-abe2-0aacfe135444.snappy.parquet

Now if I open the 20.2K parquet files and do a count action the result comes to be 0. For 99.9M file the same count operation gives some non zero result.

Now as per my understanding of repartition in dataframe, it does a full shuffle and tries to keep each partition of same size. However the above mentioned example contradicts that.

Could someone please help me here.

edited Aug 12 '18 at 07:30

asked Jan 28 '18 at 12:37

Avishek Bhattacharya

6,534
3
34
53

Please post the code you use to repartition and execution plan ([How to make good reproducible Apache Spark Dataframe examples](https://stackoverflow.com/q/48427185/8371915)) – Alper t. Turker Jan 28 '18 at 12:51
I have added the code. I don't think that is a necessity here. Repartition is a simple function call. Also it is not some dataframe manipulation problem. It needs spark theoretical explanation. – Avishek Bhattacharya Jan 28 '18 at 13:28
I feel your pain. Did you ever solve your problem? – Sebastian Wozny Dec 18 '19 at 08:52
No, I could not find a way to fix this. – Avishek Bhattacharya Dec 18 '19 at 15:59

Spark Dataframe Repartioning Causing uneven paritions

0 Answers0