2

Is it possible to configure hadoop streaming to read two or more input arguments at runtime for a job?

For example, let's say I have a script which is executed as: my_script file1 file2

How can I specify this in hadoop streaming?

As far as I know, I can only specify jobs which have the following execution syntax: my_script "fixed_params" "input".

Cihan Keser
  • 3,190
  • 4
  • 30
  • 43
Dev
  • 63
  • 8

1 Answers1

1

Haven't worked in streaming much, but I'm pretty sure you can just add another -input argument.

Also see: Using multiple mapper inputs in one streaming job on hadoop?

Community
  • 1
  • 1
HypnoticSheep
  • 843
  • 1
  • 7
  • 16
  • I don't think so (i have already tried it, and it failed)... My understanding is that -input is used to only specify the input-arg. Specifying multiple -input args mean the streaming job will consider those multiple input-directories and/or files for the mappers but not like the syntax I specified in my qs. The whole idea of streaming, I think, is based on piping the data to the mappers and reducers. And I am not sure how does hadoop handles piping multiple args to a script. – Dev Sep 10 '12 at 18:47
  • I'm not sure what you're asking then. Could you clarify what exactly you're trying to accomplish? – HypnoticSheep Sep 10 '12 at 19:20
  • Say, if I have a script which looks like this: arg1=$1; arg2=$2; do_something $arg1 $arg2; Now, how'd I run this script using hadoop-streaming. – Dev Sep 10 '12 at 20:09
  • In that case, you may want to look into http://stackoverflow.com/questions/9509063/how-do-i-pass-a-parameter-to-a-python-hadoop-streaming-job?rq=1 – HypnoticSheep Sep 10 '12 at 21:04
  • I'd try that but I feel that this would work only if the parameters are constant or reside in the local file-system. In my case, arg1 and arg2 are names of the files in HDFS, and I have to somehow specify the inputs through -input arg and make hadoop streaming to consider both the args for a single execution of the script. – Dev Sep 11 '12 at 16:06
  • Okay, so you're trying to use multiple input files from HDFS in a streaming job? By default, hadoop processes all input together, it doesn't split separate -input statements into separate jobs. Just use multiple `-input` statements. You can refer to http://hadoop.apache.org/docs/mapreduce/r0.22.0/streaming.html#How+do+I+specify+multiple+input+directories%3F as well. – HypnoticSheep Sep 11 '12 at 16:35