I want to read a list from a file in my hadoop streaming job. Here is my simple mapper.py:
#!/usr/bin/env python
import sys
import json
def read_file():
id_list = []
#read ids from a file
f = open('../user_ids','r')
for line in f:
line = line.strip()
id_list.append(line)
return id_list
if __name__ == '__main__':
id_list = set(read_file())
# input comes from STDIN (standard input)
for line in sys.stdin:
# remove leading and trailing whitespace
line = line.strip()
line = json.loads(line)
user_id = line['user']['id']
if str(user_id) in id_list:
print '%s\t%s' % (user_id, line)
and here is my reducer.py
#!/usr/bin/env python
from operator import itemgetter
import sys
current_id = None
current_list = []
id = None
# input comes from STDIN
for line in sys.stdin:
# remove leading and trailing whitespace
line = line.strip()
# parse the input we got from mapper.py
id, line = line.split('\t', 1)
# this IF-switch only works because Hadoop sorts map output
# by key (here: word) before it is passed to the reducer
if current_id == id:
current_list.append(line)
else:
if current_id:
# write result to STDOUT
print '%s\t%s' % (current_id, current_list)
current_id = id
current_list = [line]
# do not forget to output the last word if needed!
if current_id == id:
print '%s\t%s' % (current_id, current_list)
now to run it I say:
hadoop jar contrib/streaming/hadoop-streaming-1.1.1.jar -file ./mapper.py \
-mapper ./mapper.py -file ./reducer.py -reducer ./reducer.py \
-input test/input.txt -output test/output -file '../user_ids'
The job starts to run:
13/11/07 05:04:52 INFO streaming.StreamJob: map 0% reduce 0%
13/11/07 05:05:21 INFO streaming.StreamJob: map 100% reduce 100%
13/11/07 05:05:21 INFO streaming.StreamJob: To kill this job, run:
I get the error:
job not successful. Error: # of failed Map Tasks exceeded allowed limit. FailedCount: 1. LastFailedTask: task_201309172143_1390_m_000001
13/11/07 05:05:21 INFO streaming.StreamJob: killJob...
I when I do not read the ids from the file ../user_ids it does not give me any errors. I think the problem is it can not find my ../user_id file. I also have used the location in hdfs and still did not work. Thanks for your help.