Running dependent hadoop jobs in one driver

Question

I currently have two hadoop jobs where the second job requires output of the first to be added to the distributed cache. currently I run them manually, so after the first job is finished, I pass in the output file as an argument to the second job and its driver adds it to the cache.

The first job is just a simple map only job, and I was hoping that I could run one command when performed both jobs in sequence.

Can anyone help me out with the code to get the output of the first job put into the distributed cache so that it can be passed into the second job?

Thanks

Edit: This is the current driver for job 1:

public class PlaceDriver {

public static void main(String[] args) throws Exception {

    Configuration conf = new Configuration();
    String[] otherArgs = new GenericOptionsParser(conf, args).getRemainingArgs();
    if (otherArgs.length != 2) {
        System.err.println("Usage: PlaceMapper <in> <out>");
        System.exit(2);
    }
    Job job = new Job(conf, "Place Mapper");
    job.setJarByClass(PlaceDriver.class);
    job.setMapperClass(PlaceMapper.class);
    job.setOutputKeyClass(Text.class);
    job.setOutputValueClass(Text.class);
    TextInputFormat.addInputPath(job, new Path(otherArgs[0]));
    TextOutputFormat.setOutputPath(job, new Path(otherArgs[1]));
    System.exit(job.waitForCompletion(true) ? 0 : 1);
}
}

This is the driver for job2. The output of job 1 gets passed to job 2 as the first argument and loaded into the cache

public class LocalityDriver {

public static void main(String[] args) throws Exception {

    Configuration conf = new Configuration();
    String[] otherArgs = new GenericOptionsParser(conf, args).getRemainingArgs();
    if (otherArgs.length != 3) {
        System.err.println("Usage: LocalityDriver <cache> <in> <out>");
        System.exit(2);
    }
    Job job = new Job(conf, "Job Name Here");
    DistributedCache.addCacheFile(new Path(otherArgs[0]).toUri(),job.getConfiguration());
    job.setNumReduceTasks(1); //TODO: Will change
    job.setJarByClass(LocalityDriver.class);
    job.setMapperClass(LocalityMapper.class);
    job.setCombinerClass(TopReducer.class);
    job.setReducerClass(TopReducer.class);
    job.setOutputKeyClass(Text.class);
    job.setOutputValueClass(Text.class);
    TextInputFormat.addInputPath(job, new Path(otherArgs[1]));
    TextOutputFormat.setOutputPath(job, new Path(otherArgs[2]));
    System.exit(job.waitForCompletion(true) ? 0 : 1);
}
}

You can start by writing here your code that calls the two jobs, and then people can help you modify it. — adranale, Apr 25 '12 at 07:19

score 1 · Answer 1 · answered Apr 30 '12 at 01:42

Create two job objects in the same main. Have the first one wait for completion before you run the other one.

public class DefaultTest extends Configured implements Tool{


    public int run(String[] args) throws Exception {

        Job job = new Job();

        job.setJobName("DefaultTest-blockx15");

        job.setOutputKeyClass(Text.class);
        job.setOutputValueClass(IntWritable.class);

        job.setMapperClass(Map.class);
        job.setReducerClass(Reduce.class);

        job.setNumReduceTasks(15);

        job.setInputFormatClass(TextInputFormat.class);
        job.setOutputFormatClass(TextOutputFormat.class);

        FileInputFormat.setInputPaths(job, new Path(args[0]));
        FileOutputFormat.setOutputPath(job, new Path(args[1]));

        job.setJarByClass(DefaultTest.class);

        job.waitForCompletion(true):

                job2 = new Job(); 

                // define your second job with the input path defined as the output of the previous job.


        return 0;
    }

    public static void main(String[] args) throws Exception {
        Configuration conf = new Configuration();
        String[] otherArgs = new GenericOptionsParser(conf, args).getRemainingArgs();
        ToolRunner.run(new DefaultTest(), otherArgs);
    }
 }

score 0 · Answer 2 · answered Apr 25 '12 at 08:49

0

A straightforward answer would be to extract the code of the both main methods to two separate methods for example: boolean job1() and boolean job2() and call them in a main method after each other like this:

public static void main(String[] args) throws Exception {
   if (job1()) {
      jobs2();
   }
}

where the return value of job1 and job2 calls are the result of the call job.waitForCompletion(true)

answered Apr 25 '12 at 08:49

adranale

2,835
1
21
39

How do I then load the output from job1 into the distributed cache of job2? – Mitchell Weiss Apr 25 '12 at 08:55
As I understood, `new Path(otherArgs[1])` is set correctly as the output of the 1st and the input of the 2nd job. – adranale Apr 25 '12 at 09:00
In other words output the first job to a temp directory? – Mitchell Weiss Apr 25 '12 at 09:40
I would not suggest that. You can output it to a normal directory, and at the end delete it manually when the second job finishes. – adranale Apr 25 '12 at 09:43

score 0 · Answer 3 · edited May 23 '17 at 11:43

0

Job chaining in MapReduce is pretty common scenario. You can try cascading, an open source MapReduce workflow management software. And there is some discussion about cascading going on here. Or you can check similar discussions as yours here.

edited May 23 '17 at 11:43

Community

1
1

answered Apr 26 '12 at 02:11

Shumin Guo

184
1
3
11

score 0 · Answer 4 · answered Jan 27 '15 at 18:54

You can also use ChainMapper, JobControl and ControlledJob to control your job flow

Configuration config = getConf();

Job j1 = new Job(config);
Job j2 = new Job(config);
Job j3 = new Job(config);

j1.waitForCompletion(true);


JobControl jobFlow = new JobControl("j2");
ControlledJob cj3 = new ControlledJob(j2, null);
jobFlow.addJob(cj3);
jobFlow.addJob(new ControlledJob(j2, Lists.newArrayList(cj3)));
jobFlow.addJob(new ControlledJob(j3, null));
jobFlow.run();

Running dependent hadoop jobs in one driver

4 Answers4