Using multiple mapper inputs in one streaming job on hadoop?

Question

In java I would use:

MultipleInputs.addInputPath(conf, path, inputFormatClass, mapperClass)

to add multiple inputs with a different mapper for each.

Now I am using python to write a streaming job in hadoop, can a similiar job be done?

score 3 · Answer 1 · answered Aug 29 '12 at 17:50

3

You can use multiple -input options to specify multiple input paths:

hadoop jar hadoop-streaming.jar -input foo.txt -input bar.txt ...

answered Aug 29 '12 at 17:50

HypnoticSheep

843
1
7
16

1

Thank you. But what if i want different mappers for these different input path? I actually got different data sources, from which the data I want to parse to the same format so that reducer can process. – Ken Aug 30 '12 at 01:17
1

I'm not as familiar with Streaming as I am with normal HMR, so I'm not sure if there's a better way to do this, but you could run your inputs through their Mappers with identity Reducers, then take the output of those jobs and use them as input for another job with an identity Mapper and the needed Reducer. – HypnoticSheep Aug 30 '12 at 17:43
So if I want to use multiple output for multiple input, should I rewrite hadoop-streaming.jar? Or just specify input and output as shell parameters? – whyisyoung Apr 14 '15 at 02:46
Hi, How can i provide multiple input through templeton or webhcat api? – zzy Nov 18 '19 at 11:26

score 1 · Answer 2 · answered Dec 12 '16 at 09:51

I suppose this can help you: https://github.com/hyonaldo/hadoop-multiple-streaming.

Here you can see "different mappers for these different input path" as well:

hadoop jar hadoop-multiple-streaming.jar \  
  -input    myInputDirs \  
  -multiple "outputDir1|mypackage.Mapper1|mypackage.Reducer1" \  
  -multiple "outputDir2|mapper2.sh|reducer2.sh" \  
  -multiple "outputDir3|mapper3.py|reducer3.py" \  
  -multiple "outputDir4|/bin/cat|/bin/wc" \  
  -libjars  "libDir/mypackage.jar" \
  -file     "libDir/mapper2.sh" \  
  -file     "libDir/mapper3.py" \  
  -file     "libDir/reducer2.sh" \  
  -file     "libDir/reducer3.py"

Using multiple mapper inputs in one streaming job on hadoop?

2 Answers2

Linked