Why setMapOutputKeyClass method is necessary in mapreduce job

Question

When I write the mapreduce program, I often write the code like

 job1.setMapOutputKeyClass(Text.class);

But why should we specify the MapOutputKeyClass explicitly? We have already spicify it in the mapper class such as

public static class MyMapper extends
        Mapper<LongWritable, Text, Text, Text>

In the book Hadoop:The definitive Guide, there is a table shows that the method setMapOutputKeyClass is optional(Properties for configuring types), but as I test, I found it is necessary, or the Console of eclipse will show

Type mismatch in key from map: expected org.apache.hadoop.io.LongWritable, received org.apache.hadoop.io.Text

Can someone tell me the reason of it?

In the book, it says

"The settings that have to be compatible with the MapReduce types are listed in the lower part of Table 8-1". Does it mean we have to set the lower part property type, but do not have to set the higher part ones?

the content of the table looks like this:

Properties for configuring types:
mapreduce.job.inputformat.class  
mapreduce.map.output.key.class  
mapreduce.map.output.value.class  
mapreduce.job.output.key.class  
mapreduce.job.output.value.class 

Properties that must be consistent with the types:
mapreduce.job.map.class   
mapreduce.job.combine.class  
mapreduce.job.partitioner.class  
mapreduce.job.output.key.comparator.class 
mapreduce.job.output.group.comparator.class  
mapreduce.job.reduce.class  
mapreduce.job.outputformat.class

score 8 · Accepted Answer · edited Oct 26 '22 at 12:38

8

setMapOutputKeyClass() as well as setMapOutputValueClass() are optional as long as they match your job's output types specified by setOutputKeyClass() and setOutputValueClass() respectively. In other words, if your mapper output does not match your reducer output you have to use one or both of these methods.

As for your question regarding generic arguments, due to Java type erasure (Java generics type erasure: when and what happens?), Hadoop does not know them at runtime, even though they are known to the compiler.

edited Oct 26 '22 at 12:38

Amal K

4,359
2
22
44

answered Jul 14 '16 at 15:13

yurgis

4,017
1
13
22

No it does not mean that the properties mentioned at the top are optional. e.g. you still need to specify job output types. What they want to say is that you can set any types for the properties at the bottom and they will compile fine, but if they are not consistent with the properties on top, Hadoop will fail at runtime with errors such as Type mismatch or Class Cast Exceptions. – yurgis Jul 14 '16 at 15:41
It means the top properties need to be specified explicitly, except the properties which has a default value, such as the Inputformat has the default value: TextInputFormat, correct? – Coinnigh Jul 14 '16 at 16:01
1

Some of them required and some of them default to the required ones, such as mapreduce.map.output.key.class mapreduce.map.output.value.class default to mapreduce.job.output.key.class mapreduce.job.output.value.class. – yurgis Jul 14 '16 at 16:02

Why setMapOutputKeyClass method is necessary in mapreduce job

1 Answers1

Linked