0

I am trying to understand how to write a Hadoop program using Python with this tutorial http://www.michael-noll.com/tutorials/writing-an-hadoop-mapreduce-program-in-python/

This is mapper.py:

#!/usr/bin/env python
"""A more advanced Mapper, using Python iterators and generators."""

import sys

def read_input(file):
    for line in file:
        # split the line into words
        yield line.split()

def main(separator='\t'):
    # input comes from STDIN (standard input)
    data = read_input(sys.stdin)
    for words in data:
        # write the results to STDOUT (standard output);
        # what we output here will be the input for the
        # Reduce step, i.e. the input for reducer.py
        #
        # tab-delimited; the trivial word count is 1
        for word in words:
            print '%s%s%d' % (word, separator, 1)

if __name__ == "__main__":
    main()

I don't understand the use of yield. read_input generates one line at the time. However, main only calls read_input once, which corresponds to the first line of the file. How do the remaining lines get read as well?

usual me
  • 8,338
  • 10
  • 52
  • 95

1 Answers1

1

Actually, main calls read_input several times.

data = read_input(sys.stdin)
# Causes a generator to be assigned to data.
for words in data:

In every loop of the for loop, data, which is the generator returned by read_input, is called. The output of data is assigned to words.

Basically, for words in data is shorthand for "call data and assign the output to words, then execute the loop block".

Brionius
  • 13,858
  • 3
  • 38
  • 49