Hadoop program with Python - Use of generators to read files

Question

I am trying to understand how to write a Hadoop program using Python with this tutorial http://www.michael-noll.com/tutorials/writing-an-hadoop-mapreduce-program-in-python/

This is mapper.py:

#!/usr/bin/env python
"""A more advanced Mapper, using Python iterators and generators."""

import sys

def read_input(file):
    for line in file:
        # split the line into words
        yield line.split()

def main(separator='\t'):
    # input comes from STDIN (standard input)
    data = read_input(sys.stdin)
    for words in data:
        # write the results to STDOUT (standard output);
        # what we output here will be the input for the
        # Reduce step, i.e. the input for reducer.py
        #
        # tab-delimited; the trivial word count is 1
        for word in words:
            print '%s%s%d' % (word, separator, 1)

if __name__ == "__main__":
    main()

I don't understand the use of yield. read_input generates one line at the time. However, main only calls read_input once, which corresponds to the first line of the file. How do the remaining lines get read as well?

take a look at this answer: http://stackoverflow.com/a/231855/236871 — KurzedMetal, Aug 16 '13 at 14:59

score 1 · Accepted Answer · answered Aug 16 '13 at 14:48

1

Actually, main calls read_input several times.

data = read_input(sys.stdin)
# Causes a generator to be assigned to data.
for words in data:

In every loop of the for loop, data, which is the generator returned by read_input, is called. The output of data is assigned to words.

Basically, for words in data is shorthand for "call data and assign the output to words, then execute the loop block".

answered Aug 16 '13 at 14:48

Brionius

13,858
3
38
49

1

+1 Another way of thinking of it is `for words in read_input(sys.stdin)`, where `read_input` is a list that is created on the fly. – mr2ert Aug 16 '13 at 15:20
@mr2ert - indeed, agreed. – Brionius Aug 16 '13 at 15:47

Hadoop program with Python - Use of generators to read files

1 Answers1