map vs mapValues in Spark

Question

I'm currently learning Spark and developing custom machine learning algorithms. My question is what is the difference between .map() and .mapValues() and what are cases where I clearly have to use one instead of the other?

score 71 · Accepted Answer · edited May 15 '17 at 20:41

mapValues is only applicable for PairRDDs, meaning RDDs of the form RDD[(A, B)]. In that case, mapValues operates on the value only (the second part of the tuple), while map operates on the entire record (tuple of key and value).

In other words, given f: B => C and rdd: RDD[(A, B)], these two are identical (almost - see comment at the bottom):

val result: RDD[(A, C)] = rdd.map { case (k, v) => (k, f(v)) }

val result: RDD[(A, C)] = rdd.mapValues(f)

The latter is simply shorter and clearer, so when you just want to transform the values and keep the keys as-is, it's recommended to use mapValues.

On the other hand, if you want to transform the keys too (e.g. you want to apply f: (A, B) => C), you simply can't use mapValues because it would only pass the values to your function.

The last difference concerns partitioning: if you applied any custom partitioning to your RDD (e.g. using partitionBy), using map would "forget" that paritioner (the result will revert to default partitioning) as the keys might have changed; mapValues, however, preserves any partitioner set on the RDD.

I wonder if they play a role in performance, since I am trying to optimize [this](http://stackoverflow.com/questions/39235576/unbalanced-factor-of-kmeans), but from what you said, I guess it won't make any difference... — gsamaras, Sep 23 '16 at 04:33
@gsamaras it can have an impact in performance, as losing the partitioning information will force a shuffle down the road if you need to repartition again with the same key. — Madhava Carrillo, Oct 04 '17 at 14:41

score 5 · Answer 2 · answered Sep 20 '17 at 04:41

When we use map() with a Pair RDD, we get access to both Key & value. few times we are only interested in accessing the value(& not key). In those case, we can use mapValues() instead of map().

Example of mapValues

val inputrdd = sc.parallelize(Seq(("maths", 50), ("maths", 60), ("english", 65)))
val mapped = inputrdd.mapValues(mark => (mark, 1));

//
val reduced = mapped.reduceByKey((x, y) => (x._1 + y._1, x._2 + y._2))

reduced.collect

Array[(String, (Int, Int))] = Array((english,(65,1)), (maths,(110,2)))

val average = reduced.map { x =>
                           val temp = x._2
                           val total = temp._1
                           val count = temp._2
                           (x._1, total / count)
                           }

average.collect()

res1: Array[(String, Int)] = Array((english,65), (maths,55))

score 3 · Answer 3 · answered Sep 06 '17 at 19:56

map takes a function that transforms each element of a collection:

 map(f: T => U)
RDD[T] => RDD[U]

When T is a tuple we may want to only act on the values – not the keys mapValues takes a function that maps the values in the inputs to the values in the output: mapValues(f: V => W) Where RDD[ (K, V) ] => RDD[ (K, W) ]

Tip: use `mapValues` when you can avoid reshuffle when data is partitioned by key

score 3 · Answer 4 · edited Jun 05 '18 at 01:09

3

val inputrdd = sc.parallelize(Seq(("india", 250), ("england", 260), ("england", 180)))

(1)

map():-

val mapresult= inputrdd.map{b=> (b,1)}
mapresult.collect

Result-= Array(((india,250),1), ((england,260),1), ((english,180),1))

(2)

mapvalues():-

val mapValuesResult= inputrdd.mapValues(b => (b, 1));
mapValuesResult.collect

Result-

Array((india,(250,1)), (england,(260,1)), (england,(180,1)))

edited Jun 05 '18 at 01:09

Robert Columbia

6,313
15
32
40

answered Jun 05 '18 at 00:09

Kumar Nishikant

41
5

1

unless you are absolutely sure that you can fit all the data into memory, don't ever try to use `.collect`. – jtitusj Jun 05 '18 at 10:30

map vs mapValues in Spark

4 Answers4

Tip: use `mapValues` when you can avoid reshuffle when data is partitioned by key

Linked

map vs mapValues in Spark

4 Answers4

Tip: use mapValues when you can avoid reshuffle when data is partitioned by key

Linked

Tip: use `mapValues` when you can avoid reshuffle when data is partitioned by key