1

I am using spark repartition to change the number of partitions in the dataframe.

While writing the data after repartitioning I saw there are different size parquet files have been created.

Here is the code which I am using to repartition

df.repartition(partitionCount).write.mode(SaveMode.Overwrite).parquet("/test")

Most of the partitions in size KBs and some of them is in around 100MB which is the size I want to keep per partition.

Here is a sample

20.2 K  /test/part-00010-0957f5aa-1f14-4295-abe2-0aacfe135444.snappy.parquet
20.2 K  /test/part-00011-0957f5aa-1f14-4295-abe2-0aacfe135444.snappy.parquet
99.9 M  /test/part-00012-0957f5aa-1f14-4295-abe2-0aacfe135444.snappy.parquet

Now if I open the 20.2K parquet files and do a count action the result comes to be 0. For 99.9M file the same count operation gives some non zero result.

Now as per my understanding of repartition in dataframe, it does a full shuffle and tries to keep each partition of same size. However the above mentioned example contradicts that.

Could someone please help me here.

Avishek Bhattacharya
  • 6,534
  • 3
  • 34
  • 53

0 Answers0