spark.sql.files.maxPartitionBytes does not restrict partition size properly

Question

I generated a parquet file that is evenly distributed to evaluate what maxPartitionBytes does. I used a cluster with 16 cores.

Total size 2483.9 MB
Number of files/partitions in parquet file: 16
Min Size in MB: 155.2
Max Size in MB: 155.3

My understanding until now was that maxPartitionBytes restricts the size of a partition. But I realized that in some scenarios I get bigger spark partitions than I wanted. Why is it like this?

I looked at SO answers to Skewed partitions when setting spark.sql.files.maxPartitionBytes and What is openCostInBytes?

Next I did two experiments.

Experiment 1. 90m maxPartitionBytes:

32 spark partitions read
In Spark UI we can see that the data from each file have been split in 2 partitions each 29 MB and 127 MB
Now my questions are:
- Why more than 90MB as defined in max partitionBytes?
- Why does Spark not put the 29 MB partitions together?

    spark.conf.set("spark.sql.files.maxPartitionBytes", "90m")
    sdf = spark.read.format("parquet").load(par_path)
    print("Number partitions: " + str(sdf.rdd.getNumPartitions()))
    sdf.write.format("noop").mode("overwrite").save()

Experiment 2. 100m maxPartitionBytes:

The same problem as before partitions are bigger than 120 MB
Now suddenly the smaller partitions are moved together to become 22 partitions in total
Questions:
- Why does the number of partitions behave here so differently?
- Why do we still have a deviation of the partition size?

    spark.conf.set("spark.sql.files.maxPartitionBytes", "120m")
    sdf = spark.read.format("parquet").load(par_path)
    print("Number partitions: " + str(sdf.rdd.getNumPartitions()))
    sdf.write.format("noop").mode("overwrite").save()

My _guess_ is using parquet format would not be a clean test because a) those files might use compression (snappy?) internally; b) depending on a rowgroup size in parquet, Spark might not be able to split them into smaller partitions. Perhaps you can try your experiment with plain .csv and see if the results hold? Also, check out this https://community.databricks.com/t5/data-engineering/solved-maxpartitionbytes-ignored/m-p/12664 — mazaneicha, Jun 30 '23 at 20:30
Thanks for the comment. :) I checked the link. And one of the siblings suggest snappy parquet files are splittable. I think databricks does it automatically in my case. I am sure in my case the files are splitted. Cause on storage I have 16 files and 32 or 22 partitions are generated of which are not empty. — Nikolaos, Jul 02 '23 at 05:26
One think which might be the answer to the first question. I assume the deviation might also be the size of the parquet. But if I sum up the task sizes it comes up to the same size in the storage. I assume maxpartition bytes relates to the file size. — Nikolaos, Jul 02 '23 at 05:39

spark.sql.files.maxPartitionBytes does not restrict partition size properly

0 Answers0