A very simple example would be rdd.foreach(print)
which would print the value of each row in the RDD but not modify the RDD in any way.
For example, this produces an RDD with the numbers 1 - 10:
>>> rdd = sc.parallelize(xrange(0, 10)).map(lambda x: x + 1)
>>> rdd.take(10)
[1, 2, 3, 4, 5, 6, 7, 8, 9, 10]
The map
call computed a new value for each row and it returned it so that I get a new RDD. However, if I used foreach
that would be useless because foreach
doesn't modify the rdd in any way:
>>> rdd = sc.parallelize(range(0, 10)).foreach(lambda x: x + 1)
>>> type(rdd)
<class 'NoneType'>
Conversely, calling map
on a function that returns None
like print
isn't very useful:
>>> rdd = sc.parallelize(range(0, 10)).map(print)
>>> rdd.take(10)
0
1
2
3
4
5
6
7
8
9
[None, None, None, None, None, None, None, None, None, None]
The print
call returns None
so mapping that just gives you a bunch of None
values and you didn't want those values and you didn't want to save them so returning them is a waste. (Note the lines with 1
, 2
, etc. are the print
being executed and they don't show up until you call take
since the RDD is executed lazily. However the contents of the RDD are just a bunch of None
.
More simply, call map
if you care about the return value of the function. Call foreach
if you don't.