3

I know the difference between map and mapPartitions which target elements and iterators of elements respectively.

When should I use which? If the overhead is similar, why would I ever use mapPartitions, since map is easier to write?

Daniel Darabos
  • 26,991
  • 10
  • 102
  • 114
jxieeducation
  • 27
  • 1
  • 7

1 Answers1

2

RDD.map maps a function to each element of an RDD, whereas RDD.mapPartitions maps a function to each partition of an RDD.

map will not change the number of elements in an RDD, while mapPartitions might very well do so.

See also this answer and comments on a similar question.

Community
  • 1
  • 1
karlson
  • 5,325
  • 3
  • 30
  • 62
  • Nice answer. Why it can change the number? – gsamaras Sep 08 '16 at 16:28
  • 1
    @gsamaras Because you might e.g. choose to create a new object for each partition which would then make up your new RDD. Your new RDD would then have as many elements as the former had partitions. – karlson Sep 08 '16 at 16:55