How to use a file in a hadoop streaming job using python?

Question

I want to read a list from a file in my hadoop streaming job. Here is my simple mapper.py:

#!/usr/bin/env python

import sys
import json

def read_file():
    id_list = []
    #read ids from a file
    f = open('../user_ids','r')
    for line in f:
        line = line.strip()
        id_list.append(line)
    return id_list

if __name__ == '__main__':
    id_list = set(read_file())
    # input comes from STDIN (standard input)
    for line in sys.stdin:
        # remove leading and trailing whitespace
        line = line.strip()
        line = json.loads(line)
        user_id = line['user']['id']
        if str(user_id) in id_list:
            print '%s\t%s' % (user_id, line)

and here is my reducer.py

#!/usr/bin/env python

from operator import itemgetter
import sys

current_id = None
current_list = []
id = None

# input comes from STDIN
for line in sys.stdin:
    # remove leading and trailing whitespace
    line = line.strip()

    # parse the input we got from mapper.py
    id, line = line.split('\t', 1)

    # this IF-switch only works because Hadoop sorts map output
    # by key (here: word) before it is passed to the reducer
    if current_id == id:
        current_list.append(line)
    else:
        if current_id:
            # write result to STDOUT
            print '%s\t%s' % (current_id, current_list)
        current_id = id
        current_list = [line]

# do not forget to output the last word if needed!
if current_id == id:
        print '%s\t%s' % (current_id, current_list)

now to run it I say:

hadoop jar contrib/streaming/hadoop-streaming-1.1.1.jar -file ./mapper.py \
    -mapper ./mapper.py -file ./reducer.py -reducer ./reducer.py \
    -input test/input.txt  -output test/output -file '../user_ids'

The job starts to run:

13/11/07 05:04:52 INFO streaming.StreamJob:  map 0%  reduce 0%
13/11/07 05:05:21 INFO streaming.StreamJob:  map 100%  reduce 100%
13/11/07 05:05:21 INFO streaming.StreamJob: To kill this job, run:

I get the error:

job not successful. Error: # of failed Map Tasks exceeded allowed limit. FailedCount: 1.         LastFailedTask: task_201309172143_1390_m_000001
13/11/07 05:05:21 INFO streaming.StreamJob: killJob...

I when I do not read the ids from the file ../user_ids it does not give me any errors. I think the problem is it can not find my ../user_id file. I also have used the location in hdfs and still did not work. Thanks for your help.

You may want to [write some log files](http://stackoverflow.com/questions/7894770/hadoop-streaming-how-to-see-application-logs) in your hadoop streaming code, so you can tell what's going on — loopbackbee, Nov 07 '13 at 10:45
I second need of logging, as some of your lines might as well fail on `user_id = line['user']['id']` — alko, Nov 07 '13 at 11:13
@alko it works when I use a static list, instead of reading from a file. so that line should not be a problem. it means when I say: if __name__ == '__main__': id_list = ['1','2'] it works — Elham, Nov 07 '13 at 11:16

score 11 · Accepted Answer · answered Nov 07 '13 at 11:35

11

hadoop jar contrib/streaming/hadoop-streaming-1.1.1.jar -file ./mapper.py \
  -mapper ./mapper.py -file ./reducer.py -reducer ./reducer.py \
  -input test/input.txt  -output test/output -file '../user_ids'

Does ../user_ids exist on your local file path when you execute the job? If it does then you need to amend your mapper code to account for the fact that this file will be available in the local working directory of the mapper at runtime:

f = open('user_ids','r')

answered Nov 07 '13 at 11:35

Chris White

29,949
4
71
93

Thanks for your suggestion. It worked when I put it in the same directory as my mapper!!! I would up vote your answer if I had enough reputation. I am new to the community :) – Elham Nov 07 '13 at 13:19
Why you need to have `-file ./mapper.py` and `-mapper ./mapper.py`? – gsamaras Feb 07 '16 at 00:13
In the newer versions of hadoop, `-file` option is deprecated, using the generic option `-files` instead is advised by the program. `-files specify comma separated files to be copied to the map reduce cluster` – Tapajit Dey Mar 22 '17 at 18:22

score 1 · Answer 2 · answered Sep 29 '14 at 21:49

1

Try giving full path of the file or While executing hadoop command make sure you are in the same directory in which the file user_ids file is present

answered Sep 29 '14 at 21:49

Yusufali2205

1,222
9
17

How to use a file in a hadoop streaming job using python?

2 Answers2