1

I'm running a hadoop job with mapred.reduce.tasks = 100 (just experimenting). The number of maps spawned are 537 as that depends on the input splits. Problem is the number of reducers "Running" in parallel won't go beyond 4. Even after the maps are 100% complete. Is there a way to increase the number of reducers running as the CPU usage is sub optimal and the Reduce is very slow.

I have also set mapred.tasktracker.reduce.tasks.maximum = 100. But this doesn't seem to affect the numbers of reducers running in parallel.

cprsd
  • 473
  • 4
  • 13

3 Answers3

2

Check the hashcodes that are used by the partitioner; if your keys only return 4 hashcode values, Hadoop will only schedule 4 reducers.

You might need to implement your own partitioner to get more reducers, however if your mappers produce only 4 keys, 4 is the maximum number of reducers.

rsp
  • 23,135
  • 6
  • 55
  • 69
  • the mappers produce about 200,000 keys. A typical mapper output would be <"www.xyz.com", "http://www.xyz.com/page1"> . It's a key value pair. And all urls from the same host are to go to one reducer. So if I have a number of reducers running in parallel, I can process data from multiple host at a time – cprsd Nov 06 '12 at 17:12
0

You can specify the number of reducers using job configuration like below:

job.setNumReduceTasks(6);

Also, when you are executing your jar, you can pass property like below:

-D mapred.reduce.tasks=6

  • I've set mapred.reduce.tasks in mapred-site.xml. But that's not what I want. I want to increase the "Capacity" of reducers. – cprsd Nov 06 '12 at 12:01
  • The property mapred.reduce.tasks = 100 value is not functioning here.It depends on the CPU and I/O bandwidth available.Do you know the number of CPU you have ? If not, try cat /proc/cpuinfo –  Nov 06 '12 at 14:13
0

It turns out all that was required was a restart of the mapred and dfs daemons after you change the mapred-site.xml. mapred.tasktracker.reduce.tasks.maximum is indeed the right parameter to be set to increase the Reduce capacity.

Can't understand why hadoop chose not to reload the mapred-site every time when a job is submitted.

cprsd
  • 473
  • 4
  • 13
  • Just FYI, You just need to restart mapred after editing mapred-site.xml. restarting dfs is not necessary. – sufinawaz Oct 03 '13 at 15:34