2

So, I have two jobs, Job A and Job B. For Job A, I would like to have a maximum of 6 mappers per node. However, Job B is a little different. For Job B, I can only run one mapper per node. The reason for this isn't important -- let's just say this requirement is non-negotiable. I would like to tell Hadoop, "For Job A, schedule a maximum of 6 mappers per node. But for Job B, schedule a maximum of 1 mapper per node." Is this possible at all?

The only solution I can think of is :

1) Have two folders off the main hadoop folder, conf.JobA and conf.JobB. Each folder has its own copy of mapred-site.xml. conf.JobA/mapred-site.xml has a value of 6 for mapred.tasktracker.map.tasks.maximum. conf.JobB/mapred-site.xml has a value of 1 for mapred.tasktracker.map.tasks.maximum.

2) Before I run Job A :

2a) Shut down my tasktrackers

2b) Copy conf.JobA/mapred-site.xml into Hadoop's conf folder, replacing the mapred-site.xml that was already in there

2c) Restart my tasktrackers

2d) Wait for the tasktrackers to finish starting

3) Run Job A

and then do a similar thing when I need to run Job B.

I really don't like this solution; it seems kludgey and failure-prone. Is there a better way to do what I need to do?

sangfroid
  • 3,733
  • 11
  • 38
  • 42

1 Answers1

0

In your Java code for the custom jar itself you could set this configuration mapred.tasktracker.map.tasks.maximum for both of your jobs.

Do something like this:

Configuration conf = getConf();

// set number of mappers
conf.setInt("mapred.tasktracker.map.tasks.maximum", 4);

Job job = new Job(conf);

job.setJarByClass(MyMapRed.class);
job.setJobName(JOB_NAME);

job.setInputFormatClass(TextInputFormat.class);
job.setMapperClass(MapJob.class);
job.setMapOutputKeyClass(Text.class);
job.setReducerClass(ReduceJob.class);
job.setMapOutputValueClass(Text.class);
job.setOutputFormatClass(TextOutputFormat.class);

FileInputFormat.setInputPaths(job, args[0]);

boolean success = job.waitForCompletion(true);
return success ? 0 : 1;

EDIT:

You also need to set the property mapred.map.tasks to the value derived from the following formula ( mapred.tasktracker.map.tasks.maximum * Number of tasktracker Nodes in your cluster) .

Amar
  • 11,930
  • 5
  • 50
  • 73
  • Thank you for your help, but this did not work for me at all. Have you been able to run a test project with this and verify that it only used 4 mapper slots? It's my understanding that mapred.tasktracker.map.tasks.maximum can only be set on the server side, in mapred-site.xml. I'm using Hadoop .20.2, don't know if that makes a difference. – sangfroid Mar 12 '13 at 21:01
  • I haven't tried this particular config but a lot of other mapred-site's config we have been setting programatically for each job. For example I have successfully limited the reducer number to 1 by setting the following config to 1 : `mapred.reduce.tasks`. Also I have set `mapred.textoutputformat.separator` and `mapred.output.compress`. Hence, share the code(use pastebin), it's possible that you aren't doing something right. – Amar Mar 13 '13 at 11:55
  • Thanks again for the help. I tried setting mapred.reduce.tasks, but that didn't help, unfortunately. What do you set the other two parameters to? Oh, and here's the pastebin with my test project : http://pastebin.com/2V4UV5TQ – sangfroid Mar 13 '13 at 20:25