I am trying to understand how to write a Hadoop program using Python with this tutorial http://www.michael-noll.com/tutorials/writing-an-hadoop-mapreduce-program-in-python/
This is mapper.py:
#!/usr/bin/env python
"""A more advanced Mapper, using Python iterators and generators."""
import sys
def read_input(file):
for line in file:
# split the line into words
yield line.split()
def main(separator='\t'):
# input comes from STDIN (standard input)
data = read_input(sys.stdin)
for words in data:
# write the results to STDOUT (standard output);
# what we output here will be the input for the
# Reduce step, i.e. the input for reducer.py
#
# tab-delimited; the trivial word count is 1
for word in words:
print '%s%s%d' % (word, separator, 1)
if __name__ == "__main__":
main()
I don't understand the use of yield
. read_input
generates one line at the time. However, main
only calls read_input
once, which corresponds to the first line of the file. How do the remaining lines get read as well?