0

What if the reducer's output is different from mapper's output? Every documentation say it will throw error. But my question is WHY?

Reducer's output is endpoint, So why it does matter that if its different from mapper's output then it needs to be set using setMapoutputkeyclass and setMapoutputvalueclass methods... WHY IS IT REQUIRED? WHAT WILL HAPPEN if it doesn't match and not set by methods except from error?

What it takes care of when we set output types using these two methods?

Can anyone share behind this concept? I'm looking for this for a long time

Edit....!

We set up generic arguments in our mapper reducer classes to avoid runtime exception. But when we start running it turns into Object type(Java erausre).

So that means when Mapper's output stores into context.write(new Text(Year), new IntWritable(airTemp)), At run time its a bytes of data stored into object types. And when reducer class called it takes that raw bytes from mapper and produce output in raw bytes only. By default Mapper and Reducer output types are same but when its different we set using above methods. And using those methods Hadoop framework convert those raw bytes into specific type and writes into output file.

Does it makes sense...?

Joy
  • 171
  • 2
  • 10

1 Answers1

0

The reducer output can be different than the mapper.

What matters is that the reducer input types match the mapper output types.

For example, wordcount's mapper input type is (Long, Text), output type is (Text, Int), making the reducer input also (Text, Int), but the final reducer output could easily be (Text, Double) or (Null, Float)

These are the outputs for the Job

  • setOutputKeyClass()
  • setOutputValueClass()
  • setOutputFormatClass() -- Needs to be a type that exposes the output key and value

The methods you mentioned are specifically for the Map Tasks.

The defaults are TextInputFormat and TextOutputFormat, which map to <LongWritable, Text> for mapper input and your configured <K, V> reducer output.

OneCricketeer
  • 179,855
  • 19
  • 132
  • 245
  • cricket_007, Agreed..! But when reducer’s output is different then why it needs to be set or mentioned using above two methods..? – Joy Jan 21 '18 at 23:30
  • Sorry, don't understand. It needs to be set regardless of being different, or the same. How would the reduce task work if not set? – OneCricketeer Jan 22 '18 at 01:30
  • Those types are set when we write mapper and reducer class...No ? – Joy Jan 22 '18 at 01:35
  • 1
    No. The generics you pass to the Mapper and Reducer cannot be auto determined. This is due to Java type erasure. – OneCricketeer Jan 22 '18 at 01:36
  • Ah well Okay..! Can you please explain more ..? That's true mapper and reducer classes are generic types. – Joy Jan 22 '18 at 01:38
  • 1) https://stackoverflow.com/questions/14225205/where-does-job-setoutputkeyclass-and-job-setoutputreduceclass-refers-to 2) https://stackoverflow.com/a/38377972/2308683 3) https://developer.yahoo.com/hadoop/tutorial/module4.html – OneCricketeer Jan 22 '18 at 01:42
  • Thanks for sharing but I have seen those links. I understand java erasure is being used to avoid run time errors. But that belongs to reducer and mapper class. What benefit is being reaped while setting setMapoutputkeyclass(Text.class) and setMapoutputvalueclass(IntWritable.class) ? While runtime every generic type argument turned into Object...then why its required ? I'm confused here..!! – Joy Jan 22 '18 at 16:28
  • Benefit is that the overall MapReduce process knows how to convert the raw bytes of data into Java objects – OneCricketeer Jan 22 '18 at 17:28
  • Can you please check my edit comment in question ? That's what i have understood so far. Please correct me If am wrong..! – Joy Jan 23 '18 at 03:13
  • 1
    "produce output in raw bytes only" is not correct. You have to set the output types of the job. – OneCricketeer Jan 23 '18 at 03:16