More than One Reducer and One Output File

Question

In my hadoop code, I have 4 reducers and I always have 4 output Files which is quite normal as each reducer puts its result in one file. My question here: how can I have one and only one output file?

The problem is that I have an iterative mapreduce job which takes an input file, divides it into chuncks and gives each chunck to a mapper, so that's why I have to gather up all the reducers results and put them in one output file in order to divide this output file in an equivilant way into 4 parts, each part is then given to one mapper and so on.

I might be wrong but I think that you can't do that. Why not having only one reducer? — Chiron, Mar 10 '14 at 16:52
Having one reducer is not good for my application, because I want to benefit from the cluster and the resources (mappers and reducers)! So it's impossible to do that? I have been searching for a long time but I have no idea how to solve it without having a job that aggregates all the output files !! — Hadoop User, Mar 10 '14 at 18:21
Maybe there is a way to call `hadoop dfs -getmerge` from your source code, to get the output as one file, locally and then `hadoop dfs -copyFromLocal` to put it back to the cluster? Another solution could be to skip the division into chunks after the first iteration perhaps — vefthym, Mar 10 '14 at 19:46
Thanks for your proposition, but why locally ? as far as I know, dfs does things globally for all the machines that are in the cluster right? What about your second proposition: what do you mean by skipping the division into chunks? If I don't do this then could I process the data by more than one mapper? thanks — Hadoop User, Mar 10 '14 at 23:47
I have answered for a similar question. May be you wanna check it [http://stackoverflow.com/questions/5700068/merge-output-files-after-reduce-phase/33360716#33360716](http://stackoverflow.com/questions/5700068/merge-output-files-after-reduce-phase/33360716#33360716) — Aniruddha Sinha, Oct 27 '15 at 11:17
@HadoopUser: have you found any solution for this problem? I am facing the same right now. — Daisy, Dec 08 '15 at 09:38

score 0 · Answer 1 · answered Mar 11 '14 at 10:08

Can you try MultipleOutputs, where you can specify output file to which each reducer should write. For example in your reducer code:

   ...
   public void setup(Context context) {
       out = new MultipleOutputs<YourKey,YourValue>(context);     
     }
    public void reduce(YourKey key, Iterable<YourValue> values, Context context)
            throws IOException, InterruptedException {
             .......
        //instead of writing using context, use multipleoutput here
        //context.write(key, your-result);
        out.write(key, your-result,"path/filename");
    }
    public void cleanup(Context context) throws IOException,InterruptedException {
        out.close();        
     }
    .....

For this case you need to ensure some job configuration also.

......
job.setOutputFormatClass(NullOutputFormat.class);
LazyOutputFormat.setOutputFormatClass(job, FileOutputFormat.class);
FileOutputFormat.setOutputPath(job, new Path("output"));
......

In this case eachreducer out put will be written into output/path/filename

Thanks for this idea, but what about appending the file filename by multi-reducers at the same time? will it cause a problem or not? because I guess it is a matter of synchroniztaion, isn't it? — Hadoop User, Mar 11 '14 at 10:21

score -1 · Answer 2 · answered Mar 10 '14 at 17:29

-1

You can very well configure the number of reducer you wanted. while defining you job use this

job.setNumReduceTasks(1)

answered Mar 10 '14 at 17:29

sunil

1,259
1
14
27

Read the question again. He wants more than one reducer – Chiron Mar 10 '14 at 17:44
In fact yeah I need more than one reducer but only one generated output file !!! do you have an idea how to do that ? – Hadoop User Mar 10 '14 at 18:20
oops!! my appologies.I dont know any way to do that. – sunil Mar 11 '14 at 04:28

More than One Reducer and One Output File

2 Answers2