Hadoop global variable with streaming

Question

I understand that i can give some global value to my mappers via the Job and the Configuration.

But how can i do that using Hadoop Streaming(Python in my case)?

What is the right way?

What are you using to launch your jobs (Mesos, SLURM)? – carpenter Aug 07 '15 at 20:28 — carpenter, Aug 07 '15 at 20:28
I'm not familiar with both.. – member555 Aug 08 '15 at 15:18 — member555, Aug 08 '15 at 15:18

score 1 · Accepted Answer · answered Aug 07 '15 at 21:49

1

Based on the docs you can specify a command line option (-cmdenv name=value) to set environment variables on each distributed machine that you can then use in your mappers/reducers:

$HADOOP_HOME/bin/hadoop  jar $HADOOP_HOME/hadoop-streaming.jar \
    -input input.txt \
    -output output.txt \
    -mapper mapper.py \
    -reducer reducer.py \
    -file mapper.py \
    -file reducer.py \
    -cmdenv MY_PARAM=thing_I_need

answered Aug 07 '15 at 21:49

carpenter

1,192
1
14
25

is that the only way? it looks 'ugly'. Anyway, how do i access to that variable in my python code? – member555 Aug 10 '15 at 19:25
1

Hadoop manages all of the distribution for you so this simply sets up environment variables on each machine it runs on. You could also make a call over the network to some internal and static location but this is a no-no because it is expensive. Check out [this question](http://stackoverflow.com/questions/4906977/how-to-access-environment-variables-from-python) regarding accessing your environment variable. – carpenter Aug 10 '15 at 19:51

Hadoop global variable with streaming

1 Answers1

Linked