I have an rdd which I re-partition by one field
rdd = rdd.repartition( new Column("block_id"));
and save it to hdfs.
I would expect that if there are 20 different block_id
's, the repartitioning would produce 20 new partitions each holding a different block_id
.
But in fact after repartitioning there are 19 partitions, each holding exactly one block_id
and one partition holding two block_id
's.
This means that the core writing the partition with the two block_id
's to disk takes twice the time compared to the other cores and therefore doubling the overall time.