Multiple ways to write driver of Hadoop program - Which one to choose?

Question

I have observed that there are multiple ways to write driver method of Hadoop program.

Following method is given in Hadoop Tutorial by Yahoo

 public void run(String inputPath, String outputPath) throws Exception {
    JobConf conf = new JobConf(WordCount.class);
    conf.setJobName("wordcount");

    // the keys are words (strings)
    conf.setOutputKeyClass(Text.class);
    // the values are counts (ints)
    conf.setOutputValueClass(IntWritable.class);

    conf.setMapperClass(MapClass.class);
    conf.setReducerClass(Reduce.class);

    FileInputFormat.addInputPath(conf, new Path(inputPath));
    FileOutputFormat.setOutputPath(conf, new Path(outputPath));

    JobClient.runJob(conf);
  }

and this method is given in Hadoop The Definitive Guide 2012 book by Oreilly.

public static void main(String[] args) throws Exception {
  if (args.length != 2) {
    System.err.println("Usage: MaxTemperature <input path> <output path>");
    System.exit(-1);
  }
  Job job = new Job();
  job.setJarByClass(MaxTemperature.class);
  job.setJobName("Max temperature");
  FileInputFormat.addInputPath(job, new Path(args[0]));
  FileOutputFormat.setOutputPath(job, new Path(args[1]));
  job.setMapperClass(MaxTemperatureMapper.class);
  job.setReducerClass(MaxTemperatureReducer.class);
  job.setOutputKeyClass(Text.class);
  job.setOutputValueClass(IntWritable.class);
  System.exit(job.waitForCompletion(true) ? 0 : 1);
}

While trying program given in Oreilly book I found that constructors of Job class are deprecated. As Oreilly book is based on Hadoop 2 (yarn) I was surprised to see that they have used deprecated class.

I would like to know which method everyone uses?

score 5 · Accepted Answer · edited May 17 '13 at 12:28

5

I use the former approach.If we go with overriding the run() method, we can use hadoop jar options like -D,-libjars,-files etc.,.All these are very much necessary in almost any hadoop project. Not sure if we can use them through the main() method.

edited May 17 '13 at 12:28

Bill the Lizard

398,270
210
566
880

answered Apr 24 '13 at 07:39

Eswara Reddy Adapa

995
5
11

score 4 · Answer 2 · answered Apr 24 '13 at 11:39

Slightly different to your first (Yahoo) block - you should be using the ToolRunner / Tool classes which take advantage of the GenericOptionsParser (as noted in Eswara's answer)

A template pattern would be something like:

import org.apache.hadoop.conf.Configured;
import org.apache.hadoop.mapred.JobConf;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.util.Tool;
import org.apache.hadoop.util.ToolRunner;

public class ToolExample extends Configured implements Tool {

    @Override
    public int run(String[] args) throws Exception {
        // old API
        JobConf jobConf = new JobConf(getConf());

        // new API
        Job job = new Job(getConf());

        // rest of your config here

        // determine success / failure (depending on your choice of old / new api)
        // return 0 for success, non-zero for an error
        return 0;
    }

    public static void main(String args[]) throws Exception {
        System.exit(ToolRunner.run(new ToolExample(), args));
    }
}

Multiple ways to write driver of Hadoop program - Which one to choose?

2 Answers2

Linked