Spark transformation on last partitions extremely slow

Question

I am running an iterative algorithm in which during each iteration, a list of values are each assigned a set of keys (1 to N). Over time, the distribution of files over keys become skewed. I noticed that after a few iterations, coalesce phase, things seem to start running really slow on the last few partitions of my RDD.

My transformation is as follows:

dataRDD_of_20000_partitions.aggregateByKey(zeroOp)(seqOp, mergeOp)
    .mapValues(...)
    .coalesce(1000, true)
    .collect()

Here, aggregatebykey aggregates upon the keys I assigned earlier (1 to N). I can coalescing partitions because I know the number of partitions I need, and set coalesce shuffle to true in order to balance out the partitions.

Could anyone point to some reasons that these transformations may cause the last few partitions of the RDD to process slow? I am wondering if part of this has to do with data skewness.

score 2 · Answer 1 · edited May 23 '17 at 12:13

I have some observations.

You should have right number of partitions to avoid data skewness. I suspect that you have fewer partitions than required number of partitions. Have a look at this blog.
collect() call, fetches entire RDD into single driver node.It may cause OutOfMemory some times.
Transformers like aggregateByKey() may cause performance issues due to shuffling.

Have a look this SE question for more details: Spark : Tackle performance intensive commands like collect(), groupByKey(), reduceByKey()

Spark transformation on last partitions extremely slow

1 Answers1