How to get filename when running mapreduce job on EC2?

Question

I am learning elastic mapreduce and started off with the Word Splitter example provided in the Amazon Tutorial Section(code shown below). The example produces word count for all the words in all the input documents provided.

But I want to get output for Word Counts by file names i.e the count of a word in just one particular document. Since the python code for word count takes input from stdin, how do I tell which input line came from which document ?

Thanks.

#!/usr/bin/python

import sys
import re

def main(argv):
  line = sys.stdin.readline()
  pattern = re.compile("[a-zA-Z][a-zA-Z0-9]*")
  try:
    while line:
      for word in  pattern.findall(line):
        print  "LongValueSum:" + word.lower() + "\t" + "1"
      line =  sys.stdin.readline()
  except "end of file":
    return None
if __name__ == "__main__":
  main(sys.argv)

Praveen Sripati · Accepted Answer · 2011-11-10T07:29:59.143

5

In the typical WordCount example, the file name which the map file is processing is ignored, since the the job output contains the consolidated word count for all the input files and not at a file level. But to get the word count at a file level, the input file name has to be used. Mappers using Python can get the file name using the os.environ["map.input.file"] command. The list of task execution environment variables is here.

The mapper instead of just emitting the key/value pair as <Hello, 1>, should also contain the input file name being processed. The following can be the emitted by the map <input.txt, <Hello, 1>>, where input.txt is the key and <Hello, 1> is the value.

Now, all the word counts for a particular file will be processed by a single reducer. The reducer must then aggregate the word count for that particular file.

As usual, a Combiner would help to decrease the network chatter between the mapper and the reducer and also to complete the job faster.

Check Data-Intensive Text Processing with MapReduce for more algorithms on text processing.

edited Nov 10 '11 at 07:29

answered Nov 10 '11 at 07:17

Praveen Sripati

32,799
16
80
117

Thanks for the explanation ! I went through your blog and you recommend the book "Hadoop the Defn guide" for starters. But as you mentioned I need to think in a MapReduce way. Are there any good sources for it ? Also is the book good enough for learning about Hadoop development ? – Nik Nov 10 '11 at 09:01
1

Check the different problems solved with MR (http://goo.gl/kECuV). Go through the MR videos (http://goo.gl/RRoVP) by Google. The book "Hadoop : The Definitive Guide" is like Bible for Hadoop. There is also "Apress : Pro Hadoop" (http://goo.gl/VTcfa), but I don't like the style. – Praveen Sripati Nov 10 '11 at 09:51
FYI, in newer versions of Hadoop the variable is map_input_file. (The case for 2.0.2.) – Paul Feb 25 '13 at 14:10
@PraveenSripati I tried to use the help you provided but I'm getting below error. input_file = os.environ["map.input.file"] File "/usr/lib64/python2.7/UserDict.py", line 23, in __getitem__ raise KeyError(key) KeyError: 'map.input.file' – colourtheweb Nov 10 '15 at 00:37

How to get filename when running mapreduce job on EC2?

1 Answers1

Linked