13

In Hadoop, if we have not set number of reducers, then how many number of reducers will be created?

Like number of mappers is dependent on (total data size)/(input split size), E.g. if data size is 1 TB and input split size is 100 MB. Then number of mappers will be (1000*1000)/100 = 10000(Ten thousand).

The number of reducer is dependent on which factors ? How many reducers are created for a job?

Ravindra babu
  • 37,698
  • 11
  • 250
  • 211
Mohit Jain
  • 357
  • 2
  • 7
  • 18
  • 1
    The number of reducers is 1 by default, unless you set it to any custom number that makes sense for your application, using `job.setNumReduceTasks(n);`. I would suggest skipping the "rules of thumb" that exist. – vefthym Jan 11 '16 at 13:04
  • @vefthym is it still true ? I forgot to specify any number and had 56Gb of data, and it got split into 7 files of 8Gb each. Is there an automatic fallback if there is too much data for 1 reducer ? – Thomas May 03 '17 at 06:52
  • @Thomas are you referring to the number of output files, or the number of input splits? The default of 1 is for the output files (reduce tasks). I believe it is still true. If you did not set it programmaticaly, you could have also set it as a runtime parameter. – vefthym May 03 '17 at 07:02
  • @vefthym other way to say it: I didn't set any number programmatically/nor in the params. And I don't understand why I have 7 files of 8Gb each. Why not 3 files of 19Gb or 14 of 4Gb or ...? – Thomas May 03 '17 at 10:38
  • 1
    @Thomas having no information about the code that you ran, I cannot say much more. Please, add a new question and post the link here, if you wish. Having the same output size as the input size is (usually) not the case. Please clarify what you are trying to do in your question. – vefthym May 03 '17 at 10:42

2 Answers2

17

How Many Reduces? ( From official documentation)

The right number of reduces seems to be 0.95 or 1.75 multiplied by (no. of nodes) * (no. of maximum containers per node).

With 0.95 all of the reduces can launch immediately and start transferring map outputs as the maps finish. With 1.75 the faster nodes will finish their first round of reduces and launch a second wave of reduces doing a much better job of load balancing.

Increasing the number of reduces increases the framework overhead, but increases load balancing and lowers the cost of failures.

The scaling factors above are slightly less than whole numbers to reserve a few reduce slots in the framework for speculative-tasks and failed tasks.

This article covers about Mapper count too.

How Many Maps?

The number of maps is usually driven by the total size of the inputs, that is, the total number of blocks of the input files.

The right level of parallelism for maps seems to be around 10-100 maps per-node, although it has been set up to 300 maps for very cpu-light map tasks. Task setup takes a while, so it is best if the maps take at least a minute to execute.

Thus, if you expect 10TB of input data and have a blocksize of 128MB, you’ll end up with 82,000 maps, unless Configuration.set(MRJobConfig.NUM_MAPS, int) (which only provides a hint to the framework) is used to set it even higher.

If you want to change the default value of 1 for number of reducers, you can set below property (From hadoop 2.x version) as a command line parameter

mapreduce.job.reduces

OR

you can set programmatically with

job.setNumReduceTasks(integer_numer);
Ravindra babu
  • 37,698
  • 11
  • 250
  • 211
4

By default the no of reducers is set to 1.

You can change it by adding a parameter

mapred.reduce.tasks in the command line or in the Driver code or in the conf file that you pass.

e.g: Command Line Argument: bin/hadoop jar ... -Dmapred.reduce.tasks=<num reduce tasks> or, in Driver code as: conf.setNumReduceTasks(int num);

Recommended read: https://wiki.apache.org/hadoop/HowManyMapsAndReduces

Koustav Ray
  • 1,112
  • 13
  • 26