Understanding Hadoop mappers in Python

Question

I'm refering to this popular blog post:

http://www.michael-noll.com/tutorials/writing-an-hadoop-mapreduce-program-in-python/

Here, the author first demonstrates a simple regular Python mapper

#!/usr/bin/env python

import sys

# input comes from STDIN (standard input)
for line in sys.stdin:
    # remove leading and trailing whitespace
    line = line.strip()
    # split the line into words
    words = line.split()
    # increase counters
    for word in words:
        # write the results to STDOUT (standard output);
        # what we output here will be the input for the
        # Reduce step, i.e. the input for reducer.py
        #
        # tab-delimited; the trivial word count is 1
        print '%s\t%s' % (word, 1)

But then he mentions:

In a real-world application however, you might want to optimize your code by using Python iterators and generators (an even better introduction in PDF).

#!/usr/bin/env python
"""A more advanced Mapper, using Python iterators and generators."""

import sys

def read_input(file):
    for line in file:
        # split the line into words
        yield line.split()

def main(separator='\t'):
    # input comes from STDIN (standard input)
    data = read_input(sys.stdin)
    for words in data:
        # write the results to STDOUT (standard output);
        # what we output here will be the input for the
        # Reduce step, i.e. the input for reducer.py
        #
        # tab-delimited; the trivial word count is 1
        for word in words:
            print '%s%s%d' % (word, separator, 1)

if __name__ == "__main__":
    main()

My question: Why is the second mapper more efficient than the first mapper?

If I understand yield correctly; it suspends the function before another call. So basically whenever data is called to be iterated upon, read_input comes into action to yield another item.

But even in the first simple mapper, we are doing the same thing right? for line in std.stdin: would basically load whatever stdin is available to the host where the mapper is running; and we operate on it line by line.

What is the benefit of using yield and what sort of gains can I expect over the first one? Speed? Memory?

Thanks a lot.

EDIT: I'm not sure why people think it's duplicate. I'm not asking about how the 'yeild' keywork works, I'm asking about an explanation as to what benefit it provides in the hadoop streaming mapping context.

The example is quite trivial, but yield really comes into its own when you are consuming a large object. I have marked this as a duplicate because the question really isn't about `hadoop`, but about the yield keyword. — Burhan Khalid, Sep 01 '14 at 10:08
It's also relevant to Hadoop because it's directly in the context of Hadoop mappers and how hadoop streaming uses the Mapping script. Anyway, can you please elaborate on the 'large object' you're talking about in this context? Is it the stdin to one host? Is it the entire stdin? Is it the 'line' obtained from one iteration on stdin? — user1265125, Sep 01 '14 at 10:28

Understanding Hadoop mappers in Python

0 Answers0