Hadoop jobs using same reducer output to same file

Question

I ran into an interesting situation, and now am looking for how to do it intentionally. On my local single node setup, I ran 2 jobs simultaneously from the terminal screen. My both jobs use same reducer, they only have difference in map function (aggregation key - the group by), the output of both jobs was written to the output of first job (though second job did created its own folder, but it was empty). What I am working on is providing rollup aggregations across various levels, and this behavior is fascinating for me, that the aggregation output from two different levels are available to me in one single file (also perfectly sorted).

My question is how to achieve the same in real Hadoop cluster, where we have multiple data nodes i.e. I programmatically initiate multiple jobs, all accessing same input file, mapping the data differently, but using the same reducer, and the output is available in one single file, and not in 5 different output files.

Please advise.

I was taking a look at merge output files after reduce phase before I decided to ask my question.

Thanks and Kind regards,

Moiz Ahmed.

Multiple MR jobs can use the same Reducer code, but cannot use the same Reducer instance as mentioned in the OP. Each job and the associated Map and Reducer tasks are independent of other. — Praveen Sripati, Oct 24 '12 at 05:02

score 1 · Accepted Answer · answered Oct 25 '12 at 09:36

When different Mappers consume the same input file, with other words the same data structure, then source code for all these different mappers can be placed into separate methods of a single Mapper implementation and use a parameter from the context to decide which map functions to invoke. On the pluss side you need to start only one Map Reduce Job. Example is pseudo code:

class ComplexMapper extends Mapper {

protected BitSet mappingBitmap = new BitSet();

protected void setup(Context context) ... {
{
    String params = context.getConfiguration().get("params");
    ---analyze params and set bits into the mappingBitmap
}

protected void mapA(Object key, Object value, Context context){
.....
context.write(keyA, value);
}


protected void mapB(Object key, Object value, Context context){
.....
context.write(keyA, value);
}


protected void mapB(Object key, Object value, Context context){
.....
context.write(keyB, value);
}

public void map(Object key, Object value, Context context) ..... {
   if (mappingBitmap.get(1)) {
       mapA(key, value, context);
   }
   if (mappingBitmap.get(2)) {
       mapB(key, value, context);
   }
   if (mappingBitmap.get(3)) {
       mapC(key, value, context);
   }
}

Of cause it can be implemented more elegantly with interfaces etc.

In the job setup just add:

Configuration conf = new Configuration();
conf.set("params", "AB");

Job job = new Job(conf);

As Praveen Sripati mentioned, having a single output file will force you into having just one Reducer which might be bad for performance. You can always concatenate the part** files when you download them from the hdfs. Example:

hadoop fs -text /output_dir/part* > wholefile.txt

score 0 · Answer 2 · answered Oct 24 '12 at 05:04

0

Usually each reducer task produces a separate file in HDFS, so that the reduce tasks can operate in parallel. If the requirement is to have one o/p file from the reduce task then configure the job to have one reducer task. The number of reducers can be configure using the mapred.reduce.tasks property which is defaulted to 1. The con of this approach is there is only one reducer which might be a bottle neck for the job to complete.

Another option is to use some other output format which allows multiple reducers to write to the same sink simultaneously like DBOuputFormat. Once the Job processing is complete, the results from the DB can be exported into a flat file. This approach will enable multiple reduce tasks to run in parallel.

Another options is to merge the o/p files as mentioned in the OP. So, based on the pros and cons of each of the approach and the volume of the data to be processed the one of the approach can be chosen.

answered Oct 24 '12 at 05:04

Praveen Sripati

32,799
16
80
117

thank you all for the answers. Organizing based on parameter is a good option from maintenance perspective. I will see which option from Praveen Sripati's described option suites best. Can I ask why writing to same file would be bad for performance? this is HDFS we are talking about, shouldn't HDFS support parallel write to the same file? – Muhammad Moiz Ahmed Oct 26 '12 at 11:41
Can I ask why writing to same file would be bad for performance? this is HDFS we are talking about, shouldn't HDFS support parallel write to the same file? – Muhammad Moiz Ahmed Oct 26 '12 at 12:53
HDFS is not at fault here. One output file -> one Reducer instance -> only one CPU core from the cluster is doing calculations, all other cores/CPUs idle or do other stuff, but not for this particular task. – alexeipab Oct 26 '12 at 12:54

Hadoop jobs using same reducer output to same file

2 Answers2