1

I am new to Spark (using 1.1 version) and Scala .. I am converting my existing Hadoop MapReduce code to spark MR using Scala and bit lost.

I want my mapped RDD to be grouped by Key .. When i read online it's suggested that we should avoid groupByKey and use reducedByKey instead.. but when I apply reduceBykey I am not getting list of values for given key as expected by my code =>Ex.

val rdd = sc.parallelize(List(("k1", "v11"), ("k1", "v21"), ("k2", "v21"), ("k2", "v22"), ("k3", "v31") ))

My "values" for actual task are huge, having 300 plus columns in key-values pair And when I will do group by on common key it will result in shuffle which i want to avoid.

I want something like this as o/p (key, List OR Array of values) from my mapped RDD =>

rdd.groupByKey()

which gives me following Output

(k3,ArrayBuffer(v31))
(k2,ArrayBuffer(v21, v22))
(k1,ArrayBuffer(v11, v21))

But when I use

rdd.reduceByKey((x,y) => x+y)

I get values connected together like following- If pipe('|') or some other breakable character( (k2,v21|v22) ) would have been there my problem would have been little bit solved but still having list would be great for good coding practice

(k3,v31)
(k2,v21v22)
(k1,v11v21)

Please help

Yogesh
  • 191
  • 1
  • 2
  • 12
  • You can follow this link to view exact implementation [https://stackoverflow.com/questions/27002161/reduce-a-key-value-pair-into-a-key-list-pair-with-apache-spark](https://stackoverflow.com/questions/27002161/reduce-a-key-value-pair-into-a-key-list-pair-with-apache-spark) – Atul Singh Dec 01 '18 at 09:53

1 Answers1

0

If you refer the spark documentation http://spark.apache.org/docs/latest/programming-guide.html

For groupByKey It says “When called on a dataset of (K, V) pairs, returns a dataset of (K, Iterable) pairs.” The Iterable keyword is very important over here, when you get the value as (v21, v22) it’s iterable.

Further it says “Note: If you are grouping in order to perform an aggregation (such as a sum or average) over each key, using reduceByKey or aggregateByKey will yield much better performance.”

So from this what I understand is, if you want the return RDD to have iterable values use groupByKey if and if you want to have a single added up value like SUM then use reducebyKey.

Now in your tuple instead of having (String,String) => (K1,V1), if you had (String,ListBuffer(String)) => (K1,ListBuffer(“V1”)) then maybe you could have done rdd.reduceByKey( (x,y) = > x += y)

  • U mean having mapped RDD of list instead of string like =>http://apache-spark-user-list.1001560.n3.nabble.com/Help-with-groupByKey-td2238.html =>val rdd = sc.parallelize(List(("k1", List("v11")), ("k1", List("v21")), ("k2", List("v21")), ("k2", List("v22")), ("k3", List("k31")) )) val reduceRDD = rdd.reduceByKey(_ ++ _) – Yogesh Jan 19 '16 at 09:09
  • solution mentioned in above comment actually worked for me but I am not sure if its right way to do it – Yogesh Jan 20 '16 at 14:21