11

I am learning Spark in Python and wondering can anyone explain the difference between the action foreach() and transformation map()?

rdd.map() returns a new RDD, like the original map function in Python. However, I want to see a rdd.foreach() function and understand the differences. Thanks!

desertnaut
  • 57,590
  • 26
  • 140
  • 166
Chenxi Zeng
  • 367
  • 1
  • 4
  • 11

3 Answers3

19

A very simple example would be rdd.foreach(print) which would print the value of each row in the RDD but not modify the RDD in any way.

For example, this produces an RDD with the numbers 1 - 10:

>>> rdd = sc.parallelize(xrange(0, 10)).map(lambda x: x + 1)
>>> rdd.take(10)
[1, 2, 3, 4, 5, 6, 7, 8, 9, 10]

The map call computed a new value for each row and it returned it so that I get a new RDD. However, if I used foreach that would be useless because foreach doesn't modify the rdd in any way:

>>> rdd = sc.parallelize(range(0, 10)).foreach(lambda x: x + 1)
>>> type(rdd)
<class 'NoneType'>

Conversely, calling map on a function that returns None like print isn't very useful:

>>> rdd = sc.parallelize(range(0, 10)).map(print)
>>> rdd.take(10)
0
1
2
3
4
5
6
7
8
9
[None, None, None, None, None, None, None, None, None, None]

The print call returns None so mapping that just gives you a bunch of None values and you didn't want those values and you didn't want to save them so returning them is a waste. (Note the lines with 1, 2, etc. are the print being executed and they don't show up until you call take since the RDD is executed lazily. However the contents of the RDD are just a bunch of None.

More simply, call map if you care about the return value of the function. Call foreach if you don't.

Oliver Dain
  • 9,617
  • 3
  • 35
  • 48
  • rdd.foreach(print) returns a Syntax Error. I guess my question is what is the differences? the transformation map() and action foreach(), they seem to be identical to me. – Chenxi Zeng Dec 29 '16 at 23:58
  • @ChenxiZeng updates with a hopefully more clear answer. – Oliver Dain Dec 30 '16 at 00:14
  • thanks, however, ...map(print) still returns a Syntax Error (Python2.7). My understanding is that .foreach() is useful to do some action like print or print to a file, .map is to create another dataset (RDD). Is this right? – Chenxi Zeng Dec 30 '16 at 00:34
  • 1
    Yes, that's correct. I think you're getting the syntax error because you're using Python 2.x, where print isn't a regular function, and I'm using Python 3 where it is. `foreach(lambda x: print x)` should work for you and be equivalent. – Oliver Dain Dec 30 '16 at 00:38
  • I think you mean `map(lambda x: x)`, right? The `print` function still returns a syntax error... – Chenxi Zeng Dec 30 '16 at 00:48
  • 1
    @ChenxiZeng Ah - here's why the lambda with the print doesn't work: http://stackoverflow.com/questions/2970858/why-doesnt-print-work-in-a-lambda. The Python2.7 equivalent would be to define a function that just prints it's argument and then pass that to foreach like `def printit(x): print x` and then `rdd.foreach(printit)`. Note that `rdd.map(lambda x: x)` takes and RDD and creates a new RDD with exactly the same contents (it maps the value `x` to the same value). Anyway - if I've answered the question please mark this answer as correct. – Oliver Dain Dec 30 '16 at 01:27
5

Map is a transformation, thus when you perform a map you apply a function to each element in the RDD and return a new RDD where additional transformations or actions can be called.

Foreach is an action, it takes each element and applies a function, but it does not return a value. This is particularly useful in you have to call perform some calculation on an RDD and log the result somewhere else, for example a database or call a REST API with each element in the RDD.

For example let's say that you have an RDD with many queries that you wish to log in another system. The queries are stored in an RDD.

queries = <code to load queries or a transformation that was applied on other RDDs>

Then you want to save those queries in another system via a call to another API

import urllib2

def log_search(q):
    response = urllib2.urlopen('http://www.bigdatainc.org/save_query/' + q)

queries.foreach(log_search)

Now you have executed the log_query on each element of the RDD. If you have done a map, nothing would have happened yet, until you called an action.

Safwan
  • 3,300
  • 1
  • 28
  • 33
xmorera
  • 1,933
  • 3
  • 20
  • 35
0

TL;DR

  1. foreach() is an Action while map() is a Transformation
  2. foreach() is used for side-effect operations while map() i used for non-side-effect operations
  3. foreach() returns None while map() returns RDD

Long version

To understand foreach we first need to understand the side-effect operations. These operations are the processes that changes the state of the system such as

  1. writing the data to external file or database
  2. update a variable

foreach is used in operations which are side-effect operations. They have NoneType as return.

For example:

acc = sc.accumulator(0)

def add_to_accumulator(x): 
    global acc
    acc += x

sc.parallelize(range(5)).foreach(add_to_accumulator)

print(acc)

>> 10

On the other hand map is used for element wise mapping and doesn't have NoneType as return.

sc.parallelize(range(5)).map(lambda x: x**2)

>> [0, 1, 4, 9 , 16, 25]

Lawhatre
  • 1,302
  • 2
  • 10
  • 28