drawbacks to large spark partition sizes

Question

I have read that too many small partitions hurt performance because of overhead, e.g. sending a very large number of tasks to executors.

What are the downside of using maximally large partitions, e.g. why do I see recommendations in the 100s of MB range?

I can see a few potential issues:

If you lose a partition, there's a large amount of work to recompute. With many smaller partitions you may lose more often, but you will have less variance in your runtime.
If one of your few tasks on large partitions takes longer to compute than the others, this would would leave other cores un-utilized, but with smaller partitions, this can better distribute this across the cluster.

Do these issues make sense, and are there others? Thanks!

score 1 · Accepted Answer · answered Dec 08 '19 at 22:30

These two potential issues are correct.

For a better cluster usage, one should define partitions large enough to compute an HDFS block (128 / 256 MB in general) but avoid exceeding it for a better distribution allowing horizontal scaling for performance (maximazing CPU usage).

score 1 · Answer 2 · answered Dec 09 '19 at 08:18

As for the first point, you can not assume that the variance in runtime will be less if you have smaller and large number of partitions. Let's say one of the node crashes which will result in the recomputation of the rdd partition but now you have one less node to process the data your runtime will increase irrespective of the number of partitions.
If one of your few tasks on large partitions takes longer to compute than the others It happens if you have skewed data and increasing number of partitions can solve this problem but simply increasing the number of partitions isn't always sufficient.

The max partition size should not be greater than 128M which is default block size in hdfs. But you should not also have very small size partition as it add scheduling multiple tasks overhead and maintaining large meta data as well. Similar to any multithreaded application increasing the parallelism doesn't always increase performance. And in the end it comes down to finding that optimal value for which you get max performance.

By having large partition size you will have:

Less concurrency,
Increase memory pressure for transformation which involves shuffle
More susceptible for data skew. refer

Please refer : here to find optimal number of partitons.

drawbacks to large spark partition sizes

2 Answers2