so that lots of threads can work on a single partition concurrently
Partition is the smallest unit of concurrency in Spark. It means a single thread per partition. You can of course use parallel processing inside mapPartitions
but it is not a part of a standard Spark logic.
Higher parallelism means more partitions when number of partitions is not specified otherwise. Usually it is a desired outcome but it comes with a price. It means a growing cost of bookkeeping, less efficient aggregations and generally speaking less data that can be processed locally without serialization/deserialization and network traffic. It can become a serious problem when number of partitions grows when number of partitions is very high compared to the amount of data and number of available cores (see Spark iteration time increasing exponentially when using join).
When it makes sense to increase parallelism:
- you have large amount of data and a lot of spare resources (recommend number of partitions is twice a number of available cores).
- you want to reduce amount of memory required to process a single partition.
- you perform computationally intensive tasks.
When it doesn't makes sense to increase parallelism:
- parallelism >> number of available cores.
- parallelism is high compared to amount of data and you want to process more than one record at the time (
groupBy
, reduce
, agg
).
Generally speaking I believe that spark.default.parallelism
is not a very useful tool and it makes more sense to adjust parallelism on case by case basis. If parallelism is too high it can result in empty partitions in case of data loading and simple transformations and reduced performance / suboptimal resource usage. If it is too low it can lead to problems when you perform transformations which may require a large number of partitions (joins, unions).