Spark: How to get same result using reduceByKey like we get by using groupByKey any alternative solution? to avoid shuffle

Question

I am new to Spark (using 1.1 version) and Scala .. I am converting my existing Hadoop MapReduce code to spark MR using Scala and bit lost.

I want my mapped RDD to be grouped by Key .. When i read online it's suggested that we should avoid groupByKey and use reducedByKey instead.. but when I apply reduceBykey I am not getting list of values for given key as expected by my code =>Ex.

val rdd = sc.parallelize(List(("k1", "v11"), ("k1", "v21"), ("k2", "v21"), ("k2", "v22"), ("k3", "v31") ))

My "values" for actual task are huge, having 300 plus columns in key-values pair And when I will do group by on common key it will result in shuffle which i want to avoid.

I want something like this as o/p (key, List OR Array of values) from my mapped RDD =>

rdd.groupByKey()

which gives me following Output

(k3,ArrayBuffer(v31))
(k2,ArrayBuffer(v21, v22))
(k1,ArrayBuffer(v11, v21))

But when I use

rdd.reduceByKey((x,y) => x+y)

I get values connected together like following- If pipe('|') or some other breakable character( (k2,v21|v22) ) would have been there my problem would have been little bit solved but still having list would be great for good coding practice

(k3,v31)
(k2,v21v22)
(k1,v11v21)

Please help

You can follow this link to view exact implementation [https://stackoverflow.com/questions/27002161/reduce-a-key-value-pair-into-a-key-list-pair-with-apache-spark](https://stackoverflow.com/questions/27002161/reduce-a-key-value-pair-into-a-key-list-pair-with-apache-spark) — Atul Singh, Dec 01 '18 at 09:53

score 0 · Answer 1 · answered Jan 18 '16 at 16:16

If you refer the spark documentation http://spark.apache.org/docs/latest/programming-guide.html

For groupByKey It says “When called on a dataset of (K, V) pairs, returns a dataset of (K, Iterable) pairs.” The Iterable keyword is very important over here, when you get the value as (v21, v22) it’s iterable.

Further it says “Note: If you are grouping in order to perform an aggregation (such as a sum or average) over each key, using reduceByKey or aggregateByKey will yield much better performance.”

So from this what I understand is, if you want the return RDD to have iterable values use groupByKey if and if you want to have a single added up value like SUM then use reducebyKey.

Now in your tuple instead of having (String,String) => (K1,V1), if you had (String,ListBuffer(String)) => (K1,ListBuffer(“V1”)) then maybe you could have done rdd.reduceByKey( (x,y) = > x += y)

U mean having mapped RDD of list instead of string like =>http://apache-spark-user-list.1001560.n3.nabble.com/Help-with-groupByKey-td2238.html =>val rdd = sc.parallelize(List(("k1", List("v11")), ("k1", List("v21")), ("k2", List("v21")), ("k2", List("v22")), ("k3", List("k31")) )) val reduceRDD = rdd.reduceByKey(_ ++ _) — Yogesh, Jan 19 '16 at 09:09
solution mentioned in above comment actually worked for me but I am not sure if its right way to do it — Yogesh, Jan 20 '16 at 14:21

Spark: How to get same result using reduceByKey like we get by using groupByKey any alternative solution? to avoid shuffle

1 Answers1