1

I currently have two hadoop jobs where the second job requires output of the first to be added to the distributed cache. currently I run them manually, so after the first job is finished, I pass in the output file as an argument to the second job and its driver adds it to the cache.

The first job is just a simple map only job, and I was hoping that I could run one command when performed both jobs in sequence.

Can anyone help me out with the code to get the output of the first job put into the distributed cache so that it can be passed into the second job?

Thanks

Edit: This is the current driver for job 1:

public class PlaceDriver {

public static void main(String[] args) throws Exception {

    Configuration conf = new Configuration();
    String[] otherArgs = new GenericOptionsParser(conf, args).getRemainingArgs();
    if (otherArgs.length != 2) {
        System.err.println("Usage: PlaceMapper <in> <out>");
        System.exit(2);
    }
    Job job = new Job(conf, "Place Mapper");
    job.setJarByClass(PlaceDriver.class);
    job.setMapperClass(PlaceMapper.class);
    job.setOutputKeyClass(Text.class);
    job.setOutputValueClass(Text.class);
    TextInputFormat.addInputPath(job, new Path(otherArgs[0]));
    TextOutputFormat.setOutputPath(job, new Path(otherArgs[1]));
    System.exit(job.waitForCompletion(true) ? 0 : 1);
}
}

This is the driver for job2. The output of job 1 gets passed to job 2 as the first argument and loaded into the cache

public class LocalityDriver {

public static void main(String[] args) throws Exception {

    Configuration conf = new Configuration();
    String[] otherArgs = new GenericOptionsParser(conf, args).getRemainingArgs();
    if (otherArgs.length != 3) {
        System.err.println("Usage: LocalityDriver <cache> <in> <out>");
        System.exit(2);
    }
    Job job = new Job(conf, "Job Name Here");
    DistributedCache.addCacheFile(new Path(otherArgs[0]).toUri(),job.getConfiguration());
    job.setNumReduceTasks(1); //TODO: Will change
    job.setJarByClass(LocalityDriver.class);
    job.setMapperClass(LocalityMapper.class);
    job.setCombinerClass(TopReducer.class);
    job.setReducerClass(TopReducer.class);
    job.setOutputKeyClass(Text.class);
    job.setOutputValueClass(Text.class);
    TextInputFormat.addInputPath(job, new Path(otherArgs[1]));
    TextOutputFormat.setOutputPath(job, new Path(otherArgs[2]));
    System.exit(job.waitForCompletion(true) ? 0 : 1);
}
}
Mitchell Weiss
  • 93
  • 1
  • 4
  • 9
  • You can start by writing here your code that calls the two jobs, and then people can help you modify it. – adranale Apr 25 '12 at 07:19

4 Answers4

1

Create two job objects in the same main. Have the first one wait for completion before you run the other one.

public class DefaultTest extends Configured implements Tool{


    public int run(String[] args) throws Exception {

        Job job = new Job();

        job.setJobName("DefaultTest-blockx15");

        job.setOutputKeyClass(Text.class);
        job.setOutputValueClass(IntWritable.class);

        job.setMapperClass(Map.class);
        job.setReducerClass(Reduce.class);

        job.setNumReduceTasks(15);

        job.setInputFormatClass(TextInputFormat.class);
        job.setOutputFormatClass(TextOutputFormat.class);

        FileInputFormat.setInputPaths(job, new Path(args[0]));
        FileOutputFormat.setOutputPath(job, new Path(args[1]));

        job.setJarByClass(DefaultTest.class);

        job.waitForCompletion(true):

                job2 = new Job(); 

                // define your second job with the input path defined as the output of the previous job.


        return 0;
    }

    public static void main(String[] args) throws Exception {
        Configuration conf = new Configuration();
        String[] otherArgs = new GenericOptionsParser(conf, args).getRemainingArgs();
        ToolRunner.run(new DefaultTest(), otherArgs);
    }
 }
inquire
  • 761
  • 1
  • 7
  • 13
0

A straightforward answer would be to extract the code of the both main methods to two separate methods for example: boolean job1() and boolean job2() and call them in a main method after each other like this:

public static void main(String[] args) throws Exception {
   if (job1()) {
      jobs2();
   }
}

where the return value of job1 and job2 calls are the result of the call job.waitForCompletion(true)

adranale
  • 2,835
  • 1
  • 21
  • 39
0

Job chaining in MapReduce is pretty common scenario. You can try cascading, an open source MapReduce workflow management software. And there is some discussion about cascading going on here. Or you can check similar discussions as yours here.

Community
  • 1
  • 1
Shumin Guo
  • 184
  • 1
  • 3
  • 11
0

You can also use ChainMapper, JobControl and ControlledJob to control your job flow

Configuration config = getConf();

Job j1 = new Job(config);
Job j2 = new Job(config);
Job j3 = new Job(config);

j1.waitForCompletion(true);


JobControl jobFlow = new JobControl("j2");
ControlledJob cj3 = new ControlledJob(j2, null);
jobFlow.addJob(cj3);
jobFlow.addJob(new ControlledJob(j2, Lists.newArrayList(cj3)));
jobFlow.addJob(new ControlledJob(j3, null));
jobFlow.run();
surajz
  • 3,471
  • 3
  • 32
  • 38