I am using spark repartition
to change the number of partitions in the dataframe
.
While writing the data after repartitioning I saw there are different size parquet files have been created.
Here is the code which I am using to repartition
df.repartition(partitionCount).write.mode(SaveMode.Overwrite).parquet("/test")
Most of the partitions in size KBs and some of them is in around 100MB which is the size I want to keep per partition.
Here is a sample
20.2 K /test/part-00010-0957f5aa-1f14-4295-abe2-0aacfe135444.snappy.parquet
20.2 K /test/part-00011-0957f5aa-1f14-4295-abe2-0aacfe135444.snappy.parquet
99.9 M /test/part-00012-0957f5aa-1f14-4295-abe2-0aacfe135444.snappy.parquet
Now if I open the 20.2K parquet
files and do a coun
t action the result comes to be 0
. For 99.9M file
the same count
operation gives some non zero result.
Now as per my understanding of repartition
in dataframe
, it does a full shuffle and tries to keep each partition of same size. However the above mentioned example contradicts that.
Could someone please help me here.