I have gone through various articles about hash partitioning. But I still don't get it in what scenarios it is more advantageous than range partitioning. Using sortByKey followed by range partitioning allows data to be distributed evenly across cluster. But that may not be the case in hash partitioning. Consider the following example:
Consider a pair RDD with keys [8, 96, 240, 400, 401, 800] and the desired number of partition is 4.
In this case, hash partitioning distributes the keys as follows among the partitions:
partition 0: [8, 96, 240, 400, 800]
partition 1: [ 401 ]
partition 2: []
partition 3: []
(To compute partition : p = key.hashCode() % numPartitions )
The above partition leads to bad performance as the keys are not evenly distributed across all nodes. Since range partition can equally distribute the keys across the cluster, then in what scenarios hash partition proves to be a best fit over range partition?