It is Hadoop MapReduce shuffle's default behavior to sort the shuffle key within partition, but not cross partitions(It is the total ordering that makes keys sorted cross the parttions)
I would ask how to achieve the same thing using Spark RDD(sort within Partition,but not sort cross the partitions)
- RDD's
sortByKey
method is doing total ordering - RDD's
repartitionAndSortWithinPartitions
is doing sort within partition but not cross partitions, but unfortunately it adds an extra step to do repartition.
Is there a direct way to sort within partition but not cross partitions?