22

From this guide, I have successfully run the sample exercise. But on running my mapreduce job, I am getting the following error
ERROR streaming.StreamJob: Job not Successful!
10/12/16 17:13:38 INFO streaming.StreamJob: killJob...
Streaming Job Failed!

Error from the log file

java.lang.RuntimeException: PipeMapRed.waitOutputThreads(): subprocess failed with code 2
at org.apache.hadoop.streaming.PipeMapRed.waitOutputThreads(PipeMapRed.java:311)
at org.apache.hadoop.streaming.PipeMapRed.mapRedFinished(PipeMapRed.java:545)
at org.apache.hadoop.streaming.PipeMapper.close(PipeMapper.java:132)
at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:57)
at org.apache.hadoop.streaming.PipeMapRunner.run(PipeMapRunner.java:36)
at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:358)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:307)
at org.apache.hadoop.mapred.Child.main(Child.java:170)

Mapper.py

import sys

i=0

for line in sys.stdin:
    i+=1
    count={}
    for word in line.strip().split():
        count[word]=count.get(word,0)+1
    for word,weight in count.items():
        print '%s\t%s:%s' % (word,str(i),str(weight))

Reducer.py

import sys

keymap={}
o_tweet="2323"
id_list=[]
for line in sys.stdin:
    tweet,tw=line.strip().split()
    #print tweet,o_tweet,tweet_id,id_list
    tweet_id,w=tw.split(':')
    w=int(w)
    if tweet.__eq__(o_tweet):
        for i,wt in id_list:
            print '%s:%s\t%s' % (tweet_id,i,str(w+wt))
        id_list.append((tweet_id,w))
    else:
        id_list=[(tweet_id,w)]
        o_tweet=tweet

[edit] command to run the job:

hadoop@ubuntu:/usr/local/hadoop$ bin/hadoop jar contrib/streaming/hadoop-0.20.0-streaming.jar -file /home/hadoop/mapper.py -mapper /home/hadoop/mapper.py -file /home/hadoop/reducer.py -reducer /home/hadoop/reducer.py -input my-input/* -output my-output

Input is any random sequence of sentences.

Thanks,

db42
  • 4,474
  • 4
  • 32
  • 36

7 Answers7

21

Your -mapper and -reducer should just be the script name.

hadoop@ubuntu:/usr/local/hadoop$ bin/hadoop jar contrib/streaming/hadoop-0.20.0-streaming.jar -file /home/hadoop/mapper.py -mapper mapper.py -file /home/hadoop/reducer.py -reducer reducer.py -input my-input/* -output my-output

When your scripts are in the job that is in another folder within hdfs which is relative to the attempt task executing as "." (FYI if you ever want to ad another -file such as a look up table you can open it in Python as if it was in the same dir as your scripts while your script is in M/R job)

also make sure you have chmod a+x mapper.py and chmod a+x reducer.py

Joe Stein
  • 1,255
  • 2
  • 11
  • 14
  • 8
    try adding #!/usr/bin/env python to the top of your python script, it should be able to execute from the command line just by doing cat data.file|./mapper.py|sort|./reducer.py and it won't without the "#!/usr/bin/env python" at top of the file – Joe Stein Dec 17 '10 at 07:26
  • This solution wasn't working for me until I added quotation marks to the mapper and reducer inputs (i.e. -mapper "mapper.py"). I got the idea from the Apache documention. http://wiki.apache.org/hadoop/HadoopStreaming – sph21 Mar 07 '14 at 17:50
16

Try to add

 #!/usr/bin/env python

top of your script.

Or,

-mapper 'python m.py' -reducer 'r.py'
Marvin W
  • 3,423
  • 28
  • 16
  • 1
    Great! It solves my issue when adding "#!/usr/bin/env python" to the header. – Andy Dong Jul 24 '16 at 20:55
  • Damn, copy pasting code via vim removed the `#!` and caused this issue for me. Thanks for reminding to add the header!! – asgs Jun 05 '17 at 19:31
3

You need to explicitly instruct that mapper and reducer are used as python script, as we have several options for streaming. You can use either single quotes or double quotes.

-mapper "python mapper.py" -reducer "python reducer.py" 

or

-mapper 'python mapper.py' -reducer 'python reducer.py'

The full command goes like this:

hadoop jar /path/to/hadoop-mapreduce/hadoop-streaming.jar \
-input /path/to/input \
-output /path/to/output \
-mapper 'python mapper.py' \
-reducer 'python reducer.py' \
-file /path/to/mapper-script/mapper.py \
-file /path/to/reducer-script/reducer.py
Gopal Kumar
  • 311
  • 2
  • 8
2

I ran into this error recently, and my problem turned out to be something as obvious (in hindsight) as these other solutions:

I simply had a bug in my Python code. (In my case, I was using Python v2.7 string formatting whereas the AWS EMR cluster I had was using Python v2.6).

To find the actual Python error, go to Job Tracker web UI (in the case of AWS EMR, port 9100 for AMI 2.x and port 9026 for AMI 3.x); find the failed mapper; open its logs; and read the stderr output.

Dolan Antenucci
  • 15,432
  • 17
  • 74
  • 100
0

make sure your input directory only contains the correct files

  • 1
    I am having this issue.. i do not have any issue running the in outside of hadoop... but i tried the suggestions here and does not see anything on the logs .. any other way to debug this issue – E B Feb 07 '17 at 07:43
0

I too had the same problem i tried solution of marvin W and i also install spark , ensure that u have installed spark , not just pyspark(dependency) but also install the framework installtion tutorial

follow that tutorial

yunus
  • 2,445
  • 1
  • 14
  • 12
0

if you run this command in a hadoop cluster, make sure that python is installed in every NodeMnager instance. #hadoop

smbanaei
  • 1,123
  • 8
  • 14