Spark repartition does not distribute records evenly

Question

I have an rdd which I re-partition by one field

   rdd = rdd.repartition( new Column("block_id"));

and save it to hdfs.

I would expect that if there are 20 different block_id's, the repartitioning would produce 20 new partitions each holding a different block_id. But in fact after repartitioning there are 19 partitions, each holding exactly one block_id and one partition holding two block_id's. This means that the core writing the partition with the two block_id's to disk takes twice the time compared to the other cores and therefore doubling the overall time.

I'm confused, the method `repartition` on `RDD` only takes an `Int`, not a `Column` — Raphael Roth, Jul 30 '17 at 17:30

score 1 · Answer 1 · answered Jul 30 '17 at 16:06

Spark Dataset uses hash partitioning. There is no guarantee that there will be no hash colisions so you cannot expect:

that if there are 20 different block_id's, the repartitioning would produce 20 new partitions each holding a different block_id

You can try to increase number of partitions but it using number which offers good guarantees is rather impractical.

With RDDs you can design your own partitioner How to Define Custom partitioner for Spark RDDs of equally sized partition where each partition has equal number of elements?

Spark repartition does not distribute records evenly

1 Answers1