I generated a parquet file that is evenly distributed to evaluate what maxPartitionBytes
does. I used a cluster with 16 cores.
- Total size 2483.9 MB
- Number of files/partitions in parquet file: 16
- Min Size in MB: 155.2
- Max Size in MB: 155.3
My understanding until now was that maxPartitionBytes
restricts the size of a partition. But I realized that in some scenarios I get bigger spark partitions than I wanted. Why is it like this?
I looked at SO answers to Skewed partitions when setting spark.sql.files.maxPartitionBytes and What is openCostInBytes?
Next I did two experiments.
Experiment 1. 90m maxPartitionBytes:
- 32 spark partitions read
- In Spark UI we can see that the data from each file have been split in 2 partitions each 29 MB and 127 MB
- Now my questions are:
- Why more than 90MB as defined in max partitionBytes?
- Why does Spark not put the 29 MB partitions together?
spark.conf.set("spark.sql.files.maxPartitionBytes", "90m")
sdf = spark.read.format("parquet").load(par_path)
print("Number partitions: " + str(sdf.rdd.getNumPartitions()))
sdf.write.format("noop").mode("overwrite").save()
Experiment 2. 100m maxPartitionBytes:
- The same problem as before partitions are bigger than 120 MB
- Now suddenly the smaller partitions are moved together to become 22 partitions in total
- Questions:
- Why does the number of partitions behave here so differently?
- Why do we still have a deviation of the partition size?
spark.conf.set("spark.sql.files.maxPartitionBytes", "120m")
sdf = spark.read.format("parquet").load(par_path)
print("Number partitions: " + str(sdf.rdd.getNumPartitions()))
sdf.write.format("noop").mode("overwrite").save()