I am running an iterative algorithm in which during each iteration, a list of values are each assigned a set of keys (1 to N). Over time, the distribution of files over keys become skewed. I noticed that after a few iterations, coalesce phase, things seem to start running really slow on the last few partitions of my RDD.
My transformation is as follows:
dataRDD_of_20000_partitions.aggregateByKey(zeroOp)(seqOp, mergeOp)
.mapValues(...)
.coalesce(1000, true)
.collect()
Here, aggregatebykey aggregates upon the keys I assigned earlier (1 to N). I can coalescing partitions because I know the number of partitions I need, and set coalesce shuffle to true in order to balance out the partitions.
Could anyone point to some reasons that these transformations may cause the last few partitions of the RDD to process slow? I am wondering if part of this has to do with data skewness.