Optimal number of partitions in a grouped PairRDD in Spark

Question

I have two pair RDDs with the structure RDD[String, Int], called rdd1 and rdd2.

Each of these RDDs is groupped by its key, and I want to execute a function over its values (so I will use mapValues method). Does the method "GroupByKey" creates a new partition for each key or have I to specify this manually using "partitionBy"?

I understand that the partitions of a RDD won't change if I don't perform operations that change the key, so if I perform a mapValues operation on each RDD or if I perform a join operation between the previous two RDDs, the partitions of the resulting RDD won't change. Is it true?

Here we have a code example. Notice that "function" is not defined because it is not important here.

val lvl1rdd=rdd1.groupByKey()
val lvl2rdd=rdd2.groupByKey()
val lvl1_lvl2=lvl1rdd.join(lvl2rdd)
val finalrdd=lvl1_lvl2.mapValues(value => function(value))

If I join the previous RDDs and I execute a function over the values of the resulting RDD (mapValues), all the work is being done in a single worker instead of distributing the different tasks over the different workers nodes of the cluster. I mean, the desired behaviour should be to execute, in parallel, the function passed as a parameter to the mapValues method in so many nodes as the cluster allows us.

Possible duplicate of [How to calculate the best numberOfPartitions for coalesce?](https://stackoverflow.com/q/40865326/6910411) — zero323, Aug 20 '18 at 16:29

score 1 · Accepted Answer · edited Jun 20 '20 at 09:12

1) Avoid groupByKey operations as they act as bottleneck for network I/O and execution performance. Prefer reduceByKey Operation in this case as the data shuffle is comparatively less than groupByKey and we can witness the difference much better if it is a larger Dataset.

val lvl1rdd = rdd1.reduceByKey(x => function(x)) 
val lvl1rdd = rdd2.reduceByKey(x => function(x))
//perform the Join Operation on these resultant RDD's

Application of function on RDD's seperately and joining them is far better than joining RDD's and applying a function using groupByKey()

This will also ensure the tasks get distributed among different executors and execute in parallel

Refer this link

2) The underlying partitioning technique is Hash partitioner. If we assume that our data is located in n number of partitions initially then groupByKey Operation will follow Hash mechanism.

partition = key.hashCode() % numPartitions

This will create fixed number of partitions which can be more than intial number when you use the groupByKey Operation.we can also customize the partitions to be made. For example

val result_rdd = rdd1.partitionBy(new HashPartitioner(2))

This will create 2 partitions and in this way we can set the number of partitions. For deciding the optimal number of partitions refer this answer https://stackoverflow.com/a/40866286/7449292

Optimal number of partitions in a grouped PairRDD in Spark

1 Answers1