RDD: Preserve total order when repartitioning

Question

It seems one of my assumptions were incorrect regarding order in RDDs (related).

Suppose I wish to repartition a RDD after having sorted it.

import random

l = list(range(20))
random.shuffle(l)

spark.sparkContext\
.parallelize(l)\
.sortBy(lambda x:x)\
.repartition(3)\
.collect()

Which yields:

[16, 17, 18, 19, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15]

As we can see, the order is preserved within a partition but the total order is not preserved over all partitions.

I would like to preserve total order of the RDD, like so:

[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19]

I am having difficulty finding anything online which could be of assistance. Help would be appreciated.

@Kishore, I work with billions of rows, so unfortunately this won't work. — icarus, Jul 02 '18 at 11:53
@shaido, it certainly will be. Does it preserve the partitions? — icarus, Jul 03 '18 at 07:01

icarus · Accepted Answer · 2018-07-03T07:58:58.710

2

It appears that we can provide the argument numPartitions=partitions to the sortBy function to partition the RDD and preserve total order:

import random

l = list(range(20))
random.shuffle(l)

partitions = 3

spark.sparkContext\
.parallelize(l)\
.sortBy(lambda x:x ,numPartitions=partitions)\
.collect()

edited Jul 03 '18 at 07:58

answered Jul 03 '18 at 07:10

icarus

281
2
11

RDD: Preserve total order when repartitioning

1 Answers1