0

I'm very new to spark (and programming), and so if you can help me understand the difference between these 2 outputs that would be great.

map()
>>> data = ['1', '2', '3', '4', '5', 'one', 'two']
>>> distData = sc.parallelize(data)
>>> maping = distData.map(lambda x: x.split())
>>> maping.collect()
[['1'], ['2'], ['3'], ['4'], ['5'], ['one'], ['two']]                           
>>> for i in maping.take(100): print(i)
... 
['1']
['2']
['3']
['4']
['5']
['one']
['two']

FlatMap()

>>> maping = distData.flatMap(lambda x: x.split())
>>> maping.collect()
['1', '2', '3', '4', '5', 'one', 'two']
>>> for i in maping.take(100): print(i)
... 
1
2
3
4
5
one
two
user0000
  • 1
  • 1
  • 1

1 Answers1

0

A map function is a one to many transformation while a flatMap function is a one to zero or many transformation.

According to the docs,

map(func): Return a new distributed dataset formed by passing each element of the source through a function func.

flatMap(func): Similar to map, but each input item can be mapped to 0 or more output items (so func should return a Seq rather than a single item).

You can say for each input value, the flatMap outputs a sequence which can have 0 or more elements which are flattened to form output RDD.

Refer to this SO question which demonstrates a good use case.

shriyog
  • 938
  • 1
  • 13
  • 26