Output from map() and flatMap() - what is the difference

Question

I'm very new to spark (and programming), and so if you can help me understand the difference between these 2 outputs that would be great.

map()
>>> data = ['1', '2', '3', '4', '5', 'one', 'two']
>>> distData = sc.parallelize(data)
>>> maping = distData.map(lambda x: x.split())
>>> maping.collect()
[['1'], ['2'], ['3'], ['4'], ['5'], ['one'], ['two']]                           
>>> for i in maping.take(100): print(i)
... 
['1']
['2']
['3']
['4']
['5']
['one']
['two']

FlatMap()

>>> maping = distData.flatMap(lambda x: x.split())
>>> maping.collect()
['1', '2', '3', '4', '5', 'one', 'two']
>>> for i in maping.take(100): print(i)
... 
1
2
3
4
5
one
two

The duplicate is in Scsala, not pyspark. – thebluephantom Aug 14 '19 at 20:05 — thebluephantom, Aug 14 '19 at 20:05

shriyog · Answer 1 · 2018-10-28T12:13:24.363

0

A map function is a one to many transformation while a flatMap function is a one to zero or many transformation.

According to the docs,

map(func): Return a new distributed dataset formed by passing each element of the source through a function func.

flatMap(func): Similar to map, but each input item can be mapped to 0 or more output items (so func should return a Seq rather than a single item).

You can say for each input value, the flatMap outputs a sequence which can have 0 or more elements which are flattened to form output RDD.

Refer to this SO question which demonstrates a good use case.

edited Oct 28 '18 at 12:13

answered Oct 28 '18 at 11:53

shriyog

938
1
13
26

isn't a map function one to one? and a flatmap (may be ) one to many ?? – user0000 Oct 28 '18 at 12:07
Yes, it is. Specifically, you can say for each input `flatMap` outputs a sequence which can have 0 or more elements. – shriyog Oct 28 '18 at 12:10

Output from map() and flatMap() - what is the difference

1 Answers1