spark repartition issue for filesize

Question

Need to merge small parquet files. I have multiple small parquet files in hdfs. I like to combine those parquet files each to nearly 128 mb each 2. So I read all the files using spark.read() And did repartition() on that and write to the hdfs location

My issue is I have approx 7.9 GB of data, when I did repartition and saved to hdfs it is getting nearly to 22 GB.

I had tied with repartition , range , colasce but not getting the solution

After repartitioning size of all of your files together is 22gb and previously (so without partitioning) it was smaller number, right? If yes, how big is the difference? — M_S, Nov 24 '22 at 15:09
val df = spark.read.parquet("path/to/parquet/*.*"); df.repartition(10).write.mode(SaveMode.Overwrite).option("compression","snappy").parquet("/path/to/file") — pavan kumar, Nov 30 '22 at 03:55

score 0 · Answer 1 · answered Dec 17 '22 at 13:23

I think that it may be connected with your repartition operation. You are using .repartition(10) so Spark is going to use RoundRobin to repartition your data so probably ordering is going to change. Order of data is important during compresion, you can read more in this question

You may try to add sort or repartition your data by expresion instead of only number of partitions to optimize file size

spark repartition issue for filesize

1 Answers1

Linked