Pyspark: repartition vs partitionBy

Question

I'm working through these two concepts right now and would like some clarity. From working through the command line, I've been trying to identify the differences and when a developer would use repartition vs partitionBy.

Here is some sample code:

rdd = sc.parallelize([('a', 1), ('a', 2), ('b', 1), ('b', 3), ('c',1), ('ef',5)])
rdd1 = rdd.repartition(4)
rdd2 = rdd.partitionBy(4)

rdd1.glom().collect()
[[('b', 1), ('ef', 5)], [], [], [('a', 1), ('a', 2), ('b', 3), ('c', 1)]]

rdd2.glom().collect()
[[('a', 1), ('a', 2)], [], [('c', 1)], [('b', 1), ('b', 3), ('ef', 5)]]

I took a look at the implementation of both, and the only difference I've noticed for the most part is that partitionBy can take a partitioning function, or using the portable_hash by default. So in partitionBy, all the same keys should be in the same partition. In repartition, I would expect the values to be distributed more evenly over the partitions, but this isnt the case.

Given this, why would anyone ever use repartition? I suppose the only time I could see it being used is if I'm not working with PairRDD, or I have large data skew?

Is there something that I'm missing, or could someone shed light from a different angle for me?

score 22 · Answer 1 · edited Aug 10 '18 at 01:22

repartition() is used for specifying the number of partitions considering the number of cores and the amount of data you have.

partitionBy() is used for making shuffling functions more efficient, such as reduceByKey(), join(), cogroup() etc.. It is only beneficial in cases where a RDD is used for multiple times, so it is usually followed by persist().

Differences between the two in action:

pairs = sc.parallelize([1, 2, 3, 4, 2, 4, 1, 5, 6, 7, 7, 5, 5, 6, 4]).map(lambda x: (x, x))

pairs.partitionBy(3).glom().collect()
[[(3, 3), (6, 6), (6, 6)],
 [(1, 1), (4, 4), (4, 4), (1, 1), (7, 7), (7, 7), (4, 4)],
 [(2, 2), (2, 2), (5, 5), (5, 5), (5, 5)]]

pairs.repartition(3).glom().collect()
[[(4, 4), (2, 2), (6, 6), (7, 7), (5, 5), (5, 5)],
 [(1, 1), (4, 4), (6, 6), (4, 4)],
 [(2, 2), (3, 3), (1, 1), (5, 5), (7, 7)]]

score 17 · Accepted Answer · answered Nov 20 '15 at 18:41

17

repartition already exists in RDDs, and does not handle partitioning by key (or by any other criterion except Ordering). Now PairRDDs add the notion of keys and subsequently add another method that allows to partition by that key.

So yes, if your data is keyed, you should absolutely partition by that key, which in many cases is the point of using a PairRDD in the first place (for joins, reduceByKey, and so on).

answered Nov 20 '15 at 18:41

Marius Soutier

11,184
1
38
48

1

What is the reason that repartition doesn't distribute the elements evenly ish across the partitions? Could this be a case where I don't have enough data, and we're experiencing small sample size issue? – Joe Widen Nov 20 '15 at 18:54
Good question, I'm seeing an even distribution when trying it out (in Scala). – Marius Soutier Nov 20 '15 at 19:36
2

@JoeWiden Nothing else than a simple probability. `repartition` is actually using pair RDD internally by adding random key to the existing values so it doesn't provide strong guarantees about about the output data distribution. BTW You should probably accept the answer. – zero323 May 07 '16 at 10:28
2

@MariusSoutier Actually __any__ repartitioning in Spark is handled using pair RDDs. If needed Spark just adds dummy keys or dummy values to make it work. – zero323 May 07 '16 at 10:30

Pyspark: repartition vs partitionBy

2 Answers2

Linked