RDD repartition doesn't seem to shuffle if increase partitions?

Question

Spark v2.4

spark.range(5, numPartitions=1).rdd.keys().repartition(7).glom().collect()
# [[], [], [], [], [], [], [0, 1, 2, 3, 4]]

# 2 partitions initially
spark.range(5, numPartitions=2).rdd.keys().glom().collect()
# [[0, 1], [2, 3, 4]]

spark.range(5, numPartitions=2).rdd.keys().repartition(7).glom().collect()
# [[], [], [2, 3, 4], [], [], [], [0, 1]]

It seems to add empty partitions only in this case?

However, the DataFrame repartition does work properly using RoundRobinPartitioning.

spark.range(5, numPartitions=1).repartition(7).rdd.keys().glom().collect()
# [[0], [2], [4], [1], [], [], [3]]

spark.range(5, numPartitions=2).rdd.keys().glom().collect()
# [[0, 1], [2, 3, 4]]
spark.range(5, numPartitions=2).repartition(7).rdd.keys().glom().collect()
# [[1, 4], [], [], [], [], [3], [0, 2]]

What partition methods does RDD repartition use?

Update

@user10938362 lead me to a similar question where I found a reply that answers my questions.

Answered by @user11400142

That happens because Spark doesn't shuffle individual elements but rather blocks of data - with minimum batch size equal to 10.

So if you have less elements than that per partition, Spark won't separate content of partitions.

Examples

spark.range(10, numPartitions=1).rdd.keys().repartition(3).glom().collect()
# [[], [0, 1, 2, 3, 4, 5, 6, 7, 8, 9], []]
spark.range(11, numPartitions=1).rdd.keys().repartition(3).glom().collect()
# [[], [0, 1, 2, 3, 4, 5, 6, 7, 8, 9], [10]]

Just out of curiosity... why do you want (or expect) it to necessarily shuffle data? — ernest_k, Apr 23 '19 at 14:36
The first behavior has been already explained today [here](https://stackoverflow.com/a/55812826/10938362). — user10938362, Apr 23 '19 at 14:38

RDD repartition doesn't seem to shuffle if increase partitions?

Update

0 Answers0