I'm refering to this popular blog post:
http://www.michael-noll.com/tutorials/writing-an-hadoop-mapreduce-program-in-python/
Here, the author first demonstrates a simple regular Python mapper
#!/usr/bin/env python
import sys
# input comes from STDIN (standard input)
for line in sys.stdin:
# remove leading and trailing whitespace
line = line.strip()
# split the line into words
words = line.split()
# increase counters
for word in words:
# write the results to STDOUT (standard output);
# what we output here will be the input for the
# Reduce step, i.e. the input for reducer.py
#
# tab-delimited; the trivial word count is 1
print '%s\t%s' % (word, 1)
But then he mentions:
In a real-world application however, you might want to optimize your code by using Python iterators and generators (an even better introduction in PDF).
#!/usr/bin/env python
"""A more advanced Mapper, using Python iterators and generators."""
import sys
def read_input(file):
for line in file:
# split the line into words
yield line.split()
def main(separator='\t'):
# input comes from STDIN (standard input)
data = read_input(sys.stdin)
for words in data:
# write the results to STDOUT (standard output);
# what we output here will be the input for the
# Reduce step, i.e. the input for reducer.py
#
# tab-delimited; the trivial word count is 1
for word in words:
print '%s%s%d' % (word, separator, 1)
if __name__ == "__main__":
main()
My question: Why is the second mapper more efficient than the first mapper?
If I understand yield
correctly; it suspends the function before another call. So basically whenever data
is called to be iterated upon, read_input
comes into action to yield
another item.
But even in the first simple mapper, we are doing the same thing right? for line in std.stdin:
would basically load whatever stdin is available to the host where the mapper is running; and we operate on it line by line.
What is the benefit of using yield
and what sort of gains can I expect over the first one? Speed? Memory?
Thanks a lot.
EDIT: I'm not sure why people think it's duplicate. I'm not asking about how the 'yeild' keywork works, I'm asking about an explanation as to what benefit it provides in the hadoop streaming mapping context.