It seems one of my assumptions were incorrect regarding order in RDDs (related).
Suppose I wish to repartition a RDD after having sorted it.
import random
l = list(range(20))
random.shuffle(l)
spark.sparkContext\
.parallelize(l)\
.sortBy(lambda x:x)\
.repartition(3)\
.collect()
Which yields:
[16, 17, 18, 19, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15]
As we can see, the order is preserved within a partition but the total order is not preserved over all partitions.
I would like to preserve total order of the RDD, like so:
[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19]
I am having difficulty finding anything online which could be of assistance. Help would be appreciated.