2

I tried googling, but couldn't find an answer.

Taken from Apache Spark: map vs mapPartitions?

What's the difference between an RDD's map and mapPartitions

map works the function being utilized at a per element level while mapPartitions exercises the function at the partition level.

In this context, what is element level? Is it just an individual row?

Biarys
  • 1,065
  • 1
  • 10
  • 22
  • have you read [examples given by me there](https://stackoverflow.com/a/39203798/647053) if you execute them you will understand in a practical way. – Ram Ghadiyaram May 07 '20 at 01:50
  • Does this answer your question? [Apache Spark: map vs mapPartitions?](https://stackoverflow.com/questions/21185092/apache-spark-map-vs-mappartitions) – user3190018 May 07 '20 at 01:52
  • @RamGhadiyaram yes, I read it and it created confusion for me – Biarys May 07 '20 at 02:20

1 Answers1

3

In layman's terms you have a shelf with 10 racks and you have 100 balls like shown in picture. You will adjust 10 balls in 1 rack like wise.. 100 balls in 10 racks. is balldata.repartition(10)... thus uniformly distributed data(rather putting all 100 in one or 2 rack )

Now instead of applying any logic on each ball (element or row), you are going to apply logic on each rack (partition) once. is the difference.

In this case element is ball (a single row) and Partition is rack.

Advantage would be, if you are doing heavy initialization like opening database connections etc... for your processing logic... you will open one connection per partition (Rack :-)) to apply your logic rather than opening database connection for each element (Ball :-))

I advise you to go through the examples given there to understand better

enter image description here

courtesy/credits for image here

Ram Ghadiyaram
  • 28,239
  • 13
  • 95
  • 121
  • 1
    Thank you for your answer. I am confused because the result of map function I am trying to perform returns a list of tuples. So if do ```rdd = sc.parallelize(data)``` then ```res = rdd.map(self._prepricing)``` and my ```result.collect()``` is ```[(0,1,2),(3,4,5),(6,7,8)]```. Will element be a tuple or each number? – Biarys May 07 '20 at 02:19
  • in this case a tuple – Ram Ghadiyaram May 07 '20 at 02:20
  • 1
    I see. Perhaps another question but in your conclusion ```mapPartitions transformation is faster than map since it calls your function once/partition, not once/element.```. What is the point of regular map if you can apply to many elements at once? – Biarys May 07 '20 at 02:32