What's the difference between element and partition in Spark?

Question

I tried googling, but couldn't find an answer.

Taken from Apache Spark: map vs mapPartitions?

What's the difference between an RDD's map and mapPartitions

map works the function being utilized at a per element level while mapPartitions exercises the function at the partition level.

In this context, what is element level? Is it just an individual row?

have you read [examples given by me there](https://stackoverflow.com/a/39203798/647053) if you execute them you will understand in a practical way. — Ram Ghadiyaram, May 07 '20 at 01:50
Does this answer your question? [Apache Spark: map vs mapPartitions?](https://stackoverflow.com/questions/21185092/apache-spark-map-vs-mappartitions) — user3190018, May 07 '20 at 01:52
@RamGhadiyaram yes, I read it and it created confusion for me — Biarys, May 07 '20 at 02:20

Ram Ghadiyaram · Accepted Answer · 2020-05-07T14:38:19.167

3

In layman's terms you have a shelf with 10 racks and you have 100 balls like shown in picture. You will adjust 10 balls in 1 rack like wise.. 100 balls in 10 racks. is balldata.repartition(10)... thus uniformly distributed data(rather putting all 100 in one or 2 rack )

Now instead of applying any logic on each ball (element or row), you are going to apply logic on each rack (partition) once. is the difference.

In this case element is ball (a single row) and Partition is rack.

Advantage would be, if you are doing heavy initialization like opening database connections etc... for your processing logic... you will open one connection per partition (Rack :-)) to apply your logic rather than opening database connection for each element (Ball :-))

I advise you to go through the examples given there to understand better

courtesy/credits for image here

edited May 07 '20 at 14:38

answered May 07 '20 at 02:07

Ram Ghadiyaram

28,239
13
95
121

1

Thank you for your answer. I am confused because the result of map function I am trying to perform returns a list of tuples. So if do ```rdd = sc.parallelize(data)``` then ```res = rdd.map(self._prepricing)``` and my ```result.collect()``` is ```[(0,1,2),(3,4,5),(6,7,8)]```. Will element be a tuple or each number? – Biarys May 07 '20 at 02:19
in this case a tuple – Ram Ghadiyaram May 07 '20 at 02:20
1

I see. Perhaps another question but in your conclusion ```mapPartitions transformation is faster than map since it calls your function once/partition, not once/element.```. What is the point of regular map if you can apply to many elements at once? – Biarys May 07 '20 at 02:32

What's the difference between element and partition in Spark?

1 Answers1

Advantage would be, if you are doing heavy initialization like opening database connections etc... for your processing logic... you will open one connection per partition (Rack :-)) to apply your logic rather than opening database connection for each element (Ball :-))