I have a large set of end-result data and this data is non-uniformly distributed per my interested column. When I write it directly by partitioning, each partition has equal number of files as the spark.sql.shuffle.partitions. This causes each file in a crowded partition to be very large (in GBs), but in some other partitions, file size is really small (even in KBs). Is there a way to change the number of files per partition?
Example:
+----------------------------+----------+
| number of rows in category | category |
+----------------------------+----------+
| 50000000000 | A |
| 200000 | B |
| 30000 | C |
+----------------------------+----------+
If I do:
df.write.partitionBy("category").parquet(output_dir)
Sizes of files in folder "A" is large, whereas the ones in "B" and "C" is small.