The result of our reduceByKey
operation results in an RDD
that is quite skewed, with lots of data in one or two partitions. To increase the parallelism of the processing after the reduceByKey
we do a repartition
, which forces a shuffle.
rdd.reduceByKey(_+_).repartition(64)
I know that it is possible to pass a Partitioner into the reduceByKey
opertion. But (other than creating a Custom one) I think the options are the HashPartioner
and the RangePartitioner
. And I think both of these will result in the data being skewed after they partition, since the keys are quite unique.
Is it possible to shuffle the output RDD
in a reduceByKey
evenly, without the additional repartition
call?