I am executing the job as:
hadoop/bin/./hadoop jar /home/hadoopuser/hadoop/share/hadoop/tools/lib/hadoop-streaming-2.6.0.jar -D mapred.reduce.tasks=2 -file kmeans_mapper.py -mapper kmeans_mapper.py -file kmeans_reducer.py \
-reducer kmeans_reducer.py -input gutenberg/small_train.csv -output gutenberg/out
When the two reducers are done, I would like to do something with the results, so ideally I would like to call another file (another mapper?) which would receive the output of the reducers as its input. How to do that easily?
I checked this blog which has a Mrjob example, which doesn't explain, I do not get how to do mine.
The MapReduce tutorial states:
Users may need to chain MapReduce jobs to accomplish complex tasks which cannot be done via a single MapReduce job. This is fairly easy since the output of the job typically goes to distributed file-system, and the output, in turn, can be used as the input for the next job.
but it doesn't give any example...
Here is some code in Java I could understand, but I am writing Python! :/
This question sheds some light: Chaining multiple mapreduce tasks in Hadoop streaming