0

I am executing the job as:

hadoop/bin/./hadoop jar /home/hadoopuser/hadoop/share/hadoop/tools/lib/hadoop-streaming-2.6.0.jar  -D mapred.reduce.tasks=2 -file kmeans_mapper.py    -mapper kmeans_mapper.py -file kmeans_reducer.py \
-reducer kmeans_reducer.py -input gutenberg/small_train.csv -output gutenberg/out

When the two reducers are done, I would like to do something with the results, so ideally I would like to call another file (another mapper?) which would receive the output of the reducers as its input. How to do that easily?

I checked this blog which has a Mrjob example, which doesn't explain, I do not get how to do mine.

The MapReduce tutorial states:

Users may need to chain MapReduce jobs to accomplish complex tasks which cannot be done via a single MapReduce job. This is fairly easy since the output of the job typically goes to distributed file-system, and the output, in turn, can be used as the input for the next job.

but it doesn't give any example...

Here is some code in Java I could understand, but I am writing Python! :/


This question sheds some light: Chaining multiple mapreduce tasks in Hadoop streaming

Community
  • 1
  • 1
gsamaras
  • 71,951
  • 46
  • 188
  • 305
  • You specified the output to an hdfs directory, right? You'd need to do another mapreduce job with that as your input – OneCricketeer Feb 07 '16 at 04:44
  • Yes @cricket_007, I am not sure how to do that with one call. I mean I could execute the job just like my answer and then execute another job, which would invoke only a mapper. But that seems bizarre, can't I do it by pressing the **ENTER** key once? :) – gsamaras Feb 07 '16 at 04:46
  • I'm pretty sure the tutorial you are reading (which is over 2 years old for the page you linked to) is saying the Java MapReduce api allows stringing jobs together. The steaming mapreduce only goes through standard in and standard out. You may be able to pipe the commands through each other, but only if you output to standard out – OneCricketeer Feb 07 '16 at 04:50
  • @cricket_007 yes, I am using `sys.stdin` for input and `print` for output. – gsamaras Feb 07 '16 at 04:51
  • Right, in your python code because that is how the streaming api works, but it gets bundled in a jar file, and sent to hdfs. You won't be able to get the standard output of the job back to your local terminal to pipe it to a new command, that is why the output is written to HDFS – OneCricketeer Feb 07 '16 at 04:54
  • Yes I agree @cricket_007, but what does that mean for my question? No luck? :/ But I am getting the input for the mapper from HDFS... – gsamaras Feb 07 '16 at 04:57
  • Using the streaming api, afraid so. Unfortunately, I can't remember the Java Api well enough to give an example for chaining jobs. – OneCricketeer Feb 07 '16 at 04:59
  • Maybe the update I did will help. Thanks @cricket_007 for your time! :) – gsamaras Feb 07 '16 at 05:01
  • I simply don't think what you're asking for is possible with the streaming api – OneCricketeer Feb 07 '16 at 05:02
  • That would be an answer @cricket_007, so I could stop looking! – gsamaras Feb 07 '16 at 05:03
  • Alright. I'll try to explain that in an answer, then. – OneCricketeer Feb 07 '16 at 05:04
  • @cricket_007 you may want to check my last edit before posting an answer. – gsamaras Feb 07 '16 at 05:07
  • Good luck getting an answer from that person, they haven't been active in 4 years – OneCricketeer Feb 07 '16 at 05:12

1 Answers1

1

It is possible to do what you're asking for using the Java API as you've found an example for.

But, you are using the streaming API which simply reads standard in and writes to standard out. There is no callback to say when a mapreduce job has completed other than the completion of the hadoop jar command. But, because it completed, doesn't really indicate a "success". That being said, it really isn't possible without some more tooling around the streaming API.

If the output was written to the local terminal rather than to HDFS, it might be possible to pipe that output into the input of another streaming job, but unfortunately, the inputs and outputs to the steaming jar require paths on HDFS.

OneCricketeer
  • 179,855
  • 19
  • 132
  • 245