I have read that too many small partitions hurt performance because of overhead, e.g. sending a very large number of tasks to executors.
What are the downside of using maximally large partitions, e.g. why do I see recommendations in the 100s of MB range?
I can see a few potential issues:
- If you lose a partition, there's a large amount of work to recompute. With many smaller partitions you may lose more often, but you will have less variance in your runtime.
- If one of your few tasks on large partitions takes longer to compute than the others, this would would leave other cores un-utilized, but with smaller partitions, this can better distribute this across the cluster.
Do these issues make sense, and are there others? Thanks!