In java I would use:
MultipleInputs.addInputPath(conf, path, inputFormatClass, mapperClass)
to add multiple inputs with a different mapper for each.
Now I am using python to write a streaming job in hadoop, can a similiar job be done?
In java I would use:
MultipleInputs.addInputPath(conf, path, inputFormatClass, mapperClass)
to add multiple inputs with a different mapper for each.
Now I am using python to write a streaming job in hadoop, can a similiar job be done?
You can use multiple -input options to specify multiple input paths:
hadoop jar hadoop-streaming.jar -input foo.txt -input bar.txt ...
I suppose this can help you: https://github.com/hyonaldo/hadoop-multiple-streaming.
Here you can see "different mappers for these different input path" as well:
hadoop jar hadoop-multiple-streaming.jar \
-input myInputDirs \
-multiple "outputDir1|mypackage.Mapper1|mypackage.Reducer1" \
-multiple "outputDir2|mapper2.sh|reducer2.sh" \
-multiple "outputDir3|mapper3.py|reducer3.py" \
-multiple "outputDir4|/bin/cat|/bin/wc" \
-libjars "libDir/mypackage.jar" \
-file "libDir/mapper2.sh" \
-file "libDir/mapper3.py" \
-file "libDir/reducer2.sh" \
-file "libDir/reducer3.py"