1

I'm using hadoop 1.0.1 on a single node and I'm trying to stream a tab delimited file using python 2.7. I can get Michael Noll's word count scripts to run using hadoop/python, but can't get this extremely simple mapper and reducer to work that just duplicates the file. Here's the mapper:

import sys

for line in sys.stdin:
    line = line.strip()
    print '%s' % line

Here's the reducer:

import sys

for line in sys.stdin:
    line = line.strip()
    print line

Here's part of the input file:

1   857774.000000
2   859164.000000
3   859350.000000
...

The mapper and reducer work fine within linux:

cat input.txt | python mapper.py | sort | python reducer.py > a.out

but after I chmod the mapper and reducer, move the input file to hdfs and check that it's there and run:

bin/hadoop jar contrib/streaming/hadoop-*streaming*.jar -file mapperSimple.py -mapper mapperSimple.py -file reducerSimple.py -reducer reducerSimple.py -input inputDir/* -output outputDir

I get the following error:

12/06/03 10:19:11 INFO streaming.StreamJob:  map 0%  reduce 0%
12/06/03 10:20:15 ERROR streaming.StreamJob: Job not successful. Error: # of failed Map Tasks exceeded allowed limit. FailedCount: 1. LastFailedTask: task_201206030550_0003_m_000001
12/06/03 10:20:15 INFO streaming.StreamJob: killJob...
Streaming Job Failed!

Any ideas? Thanks.

user1106278
  • 747
  • 1
  • 9
  • 18

2 Answers2

3

Do your python files have the shebang / hashbang headers? I imagine your problem is when Java comes to execute the mapper python file, it's asking the os to execute the file, and without shebang / hashbang notation, it doesn't know how to execute the file. I would also ensure your files are marked with executable permissions (chmod a+x mapperSimple.py):

#!/usr/bin/python
import sys

for line in sys.stdin:
    line = line.strip()
    print '%s' % line

Try this from the command line to ensure the shell knows to execute the files with the python interpreter:

cat input.txt | ./mapper.py | sort | ./reducer.py > a.out
Chris White
  • 29,949
  • 4
  • 71
  • 93
  • Thanks, Chris. My brain always ignores lines with the comment symbol, so I ended up spending 20 hours cutting back my code and trying every variation I could think of. Much appreciated. – user1106278 Jun 03 '12 at 15:51
0

In addition to Chris White Answer, the shebang header should be:

#!/usr/bin/env python

which will use python2.7 by default. if you want to use python3, you can use:

#!/usr/bin/env python3

And DO NOT use:

#!/usr/bin/python

Because it would fail on most computers.. including mine ****sigh****

Check this Answer for more info

Community
  • 1
  • 1
Anwarvic
  • 12,156
  • 4
  • 49
  • 69