2

In my hadoop code, I have 4 reducers and I always have 4 output Files which is quite normal as each reducer puts its result in one file. My question here: how can I have one and only one output file?

The problem is that I have an iterative mapreduce job which takes an input file, divides it into chuncks and gives each chunck to a mapper, so that's why I have to gather up all the reducers results and put them in one output file in order to divide this output file in an equivilant way into 4 parts, each part is then given to one mapper and so on.

Hadoop User
  • 114
  • 1
  • 10
  • I might be wrong but I think that you can't do that. Why not having only one reducer? – Chiron Mar 10 '14 at 16:52
  • Having one reducer is not good for my application, because I want to benefit from the cluster and the resources (mappers and reducers)! So it's impossible to do that? I have been searching for a long time but I have no idea how to solve it without having a job that aggregates all the output files !! – Hadoop User Mar 10 '14 at 18:21
  • Maybe there is a way to call `hadoop dfs -getmerge` from your source code, to get the output as one file, locally and then `hadoop dfs -copyFromLocal` to put it back to the cluster? Another solution could be to skip the division into chunks after the first iteration perhaps – vefthym Mar 10 '14 at 19:46
  • Thanks for your proposition, but why locally ? as far as I know, dfs does things globally for all the machines that are in the cluster right? What about your second proposition: what do you mean by skipping the division into chunks? If I don't do this then could I process the data by more than one mapper? thanks – Hadoop User Mar 10 '14 at 23:47
  • I have answered for a similar question. May be you wanna check it [http://stackoverflow.com/questions/5700068/merge-output-files-after-reduce-phase/33360716#33360716](http://stackoverflow.com/questions/5700068/merge-output-files-after-reduce-phase/33360716#33360716) – Aniruddha Sinha Oct 27 '15 at 11:17
  • @HadoopUser: have you found any solution for this problem? I am facing the same right now. – Daisy Dec 08 '15 at 09:38

2 Answers2

0

Can you try MultipleOutputs, where you can specify output file to which each reducer should write. For example in your reducer code:

   ...
   public void setup(Context context) {
       out = new MultipleOutputs<YourKey,YourValue>(context);     
     }
    public void reduce(YourKey key, Iterable<YourValue> values, Context context)
            throws IOException, InterruptedException {
             .......
        //instead of writing using context, use multipleoutput here
        //context.write(key, your-result);
        out.write(key, your-result,"path/filename");
    }
    public void cleanup(Context context) throws IOException,InterruptedException {
        out.close();        
     }
    .....

For this case you need to ensure some job configuration also.

......
job.setOutputFormatClass(NullOutputFormat.class);
LazyOutputFormat.setOutputFormatClass(job, FileOutputFormat.class);
FileOutputFormat.setOutputPath(job, new Path("output"));
......

In this case eachreducer out put will be written into output/path/filename

Tom Sebastian
  • 3,373
  • 5
  • 29
  • 54
  • Thanks for this idea, but what about appending the file filename by multi-reducers at the same time? will it cause a problem or not? because I guess it is a matter of synchroniztaion, isn't it? – Hadoop User Mar 11 '14 at 10:21
-1

You can very well configure the number of reducer you wanted. while defining you job use this

job.setNumReduceTasks(1)

sunil
  • 1,259
  • 1
  • 14
  • 27