0

So I saw this question on stackoverflow asked by another user and I tried to write the code myself as I am trying to practice scala and spark:

The question was to find the per-key average from a list:

Assuming the list is: ( (1,1), (1,3), (2,4), (2,3), (3,1) )

The code was:

val result = input.combineByKey(
(v) => (v, 1),
(acc: (Int, Int), v) => (acc._1 + v, acc._2 + 1),
(acc1: (Int, Int), acc2: (Int, Int)) => (acc1._1 + acc2._1, acc1._2 + acc2._2)).
map{ case (key, value) => (key, value._1 / value._2.toFloat) }  
result.collectAsMap().map(println(_))

So basically the above code will create an RDD of type [Int, (Int, Int)] where the first Int is the key and the value is (Int, Int) where the first Int here is the addition of all the values with the same key and the second Int is the amount of times the key appeared.

I understand what is going on but for some reason when I rewrite the code like this:

val result = input.combineByKey(
(v) => (v, 1),
(acc: (Int, Int), v) => (acc._1 + v, acc._2 + 1),
(acc1: (Int, Int), acc2: (Int, Int)) => (acc1._1 + acc2._1, acc1._2 + acc2._2)).
mapValues(value: (Int, Int) => (value._1 / value._2))
result.collectAsMap().map(println(_))

When I use mapValues instead of map with the case keyword the code doesn't work.It gives an error saying error: not found: type / What is the difference when using map with case and mapValues. Since I thought map values will just take the value (which in this case is a (Int,Int)) and return to you a new value and the key remains the same for the key value pair.

2 Answers2

0

Never mind I found a good article to my problem: http://danielwestheide.com/blog/2012/12/12/the-neophytes-guide-to-scala-part-4-pattern-matching-anonymous-functions.html

If anyone else has the same problem that explains it well!

0

try

val result = input.combineByKey(
(v) => (v, 1),
(acc: (Int, Int), v) => (acc._1 + v, acc._2 + 1),
(acc1: (Int, Int), acc2: (Int, Int)) => (acc1._1 + acc2._1, acc1._2 + acc2._2)).
mapValues(value => (value._1 / value._2))
result.collectAsMap().map(println(_))
banjara
  • 3,800
  • 3
  • 38
  • 61
  • Oh this worked as well! Why is that you do not need the `value: (Int, Int)`? I also did this and it worked: `result.map(elem => (elem._1, (elem._2._1 / elem._2._2))).collectAsMap().map(println(_))` –  May 11 '16 at 05:58
  • @1290 I don't know the exact answer but you don't need to explicitly specify data type of rdd in spark transformations/actions. – banjara May 11 '16 at 09:07
  • Oh ok I see. Thanks! –  May 12 '16 at 22:38