36

I'm currently learning Spark and developing custom machine learning algorithms. My question is what is the difference between .map() and .mapValues() and what are cases where I clearly have to use one instead of the other?

jtitusj
  • 3,046
  • 3
  • 24
  • 40

4 Answers4

71

mapValues is only applicable for PairRDDs, meaning RDDs of the form RDD[(A, B)]. In that case, mapValues operates on the value only (the second part of the tuple), while map operates on the entire record (tuple of key and value).

In other words, given f: B => C and rdd: RDD[(A, B)], these two are identical (almost - see comment at the bottom):

val result: RDD[(A, C)] = rdd.map { case (k, v) => (k, f(v)) }

val result: RDD[(A, C)] = rdd.mapValues(f)

The latter is simply shorter and clearer, so when you just want to transform the values and keep the keys as-is, it's recommended to use mapValues.

On the other hand, if you want to transform the keys too (e.g. you want to apply f: (A, B) => C), you simply can't use mapValues because it would only pass the values to your function.

The last difference concerns partitioning: if you applied any custom partitioning to your RDD (e.g. using partitionBy), using map would "forget" that paritioner (the result will revert to default partitioning) as the keys might have changed; mapValues, however, preserves any partitioner set on the RDD.

Tombart
  • 30,520
  • 16
  • 123
  • 136
Tzach Zohar
  • 37,442
  • 3
  • 79
  • 85
  • 1
    I wonder if they play a role in performance, since I am trying to optimize [this](http://stackoverflow.com/questions/39235576/unbalanced-factor-of-kmeans), but from what you said, I guess it won't make any difference... – gsamaras Sep 23 '16 at 04:33
  • 2
    @gsamaras it can have an impact in performance, as losing the partitioning information will force a shuffle down the road if you need to repartition again with the same key. – Madhava Carrillo Oct 04 '17 at 14:41
5

When we use map() with a Pair RDD, we get access to both Key & value. few times we are only interested in accessing the value(& not key). In those case, we can use mapValues() instead of map().

Example of mapValues

val inputrdd = sc.parallelize(Seq(("maths", 50), ("maths", 60), ("english", 65)))
val mapped = inputrdd.mapValues(mark => (mark, 1));

//
val reduced = mapped.reduceByKey((x, y) => (x._1 + y._1, x._2 + y._2))

reduced.collect

Array[(String, (Int, Int))] = Array((english,(65,1)), (maths,(110,2)))

val average = reduced.map { x =>
                           val temp = x._2
                           val total = temp._1
                           val count = temp._2
                           (x._1, total / count)
                           }

average.collect()

res1: Array[(String, Int)] = Array((english,65), (maths,55))

vaquar khan
  • 10,864
  • 5
  • 72
  • 96
3

map takes a function that transforms each element of a collection:

 map(f: T => U)
RDD[T] => RDD[U]

When T is a tuple we may want to only act on the values – not the keys mapValues takes a function that maps the values in the inputs to the values in the output: mapValues(f: V => W) Where RDD[ (K, V) ] => RDD[ (K, W) ]

Tip: use mapValues when you can avoid reshuffle when data is partitioned by key

Ram Ghadiyaram
  • 28,239
  • 13
  • 95
  • 121
3
val inputrdd = sc.parallelize(Seq(("india", 250), ("england", 260), ("england", 180)))

(1)

map():-

val mapresult= inputrdd.map{b=> (b,1)}
mapresult.collect

Result-= Array(((india,250),1), ((england,260),1), ((english,180),1))

(2)

mapvalues():-

val mapValuesResult= inputrdd.mapValues(b => (b, 1));
mapValuesResult.collect

Result-

Array((india,(250,1)), (england,(260,1)), (england,(180,1)))
Robert Columbia
  • 6,313
  • 15
  • 32
  • 40
  • 1
    unless you are absolutely sure that you can fit all the data into memory, don't ever try to use `.collect`. – jtitusj Jun 05 '18 at 10:30