Number of DataNodes and MapTasks in Hadoop

Question

how to set the number of DataNodes in Hadoop? is it by code, configuration, or environment decision. Also during surfing the articles when someone says "The preferred number of maps around 10-100 maps per-node" so "node" here means NameNode or DataNode?

And when talking about the number of MapTasks some says it is equal to the number of splits, another says to the number of blocks, while others say it is determined by the framework and may not give exact number of splits or blocks, so which is right from them?

score 3 · Answer 1 · edited Nov 29 '16 at 10:21

3

Question : How to set the number of DataNodes in Hadoop?

For setting or calculating the number of DataNodes. Firstly estimate the Hadoop Storage (H) :

H=crS/(1-i)

where:

c = average compression ratio. It depends on the type of compression used (Snappy, LZOP, ...) and size of the data. When no compression is used, c=1.

r = replication factor. It is usually 3 in a production cluster.

S = size of data to be moved to Hadoop. This could be a combination of historical data and incremental data. The incremental data can be daily for example and projected over a period of time (3 years for example).

i = intermediate factor. It is usually 1/3 or 1/4. Hadoop's working space dedicated to storing intermediate results of Map phases.

Example: With no compression i.e. c=1, a replication factor of 3, an intermediate factor of .25=1/4

H= 1*3*S/(1-1/4)=3*S/(3/4)=4*S

With the assumptions above, the Hadoop storage is estimated to be 4 times the size of the initial data size.

Now the formula to estimate the number of Data nodes (n):

n= H/d = crS/(1-i)*d

where:

d = disk space available per node.

Question : "The preferred number of maps around 10-100 maps per-node" so "node" here means NameNode or DataNode?

As you know that MapReduce jobs go to the data for the processing but vice-versa is not true. So here "node" is Data Node.

Question : How Many Maps?

The number of maps is usually driven by the total size of the inputs, that is, the total number of blocks of the input files.

The right level of parallelism for maps seems to be around 10-100 maps per-node, although it has been set up to 300 maps for very cpu-light map tasks. Task setup takes a while, so it is best if the maps take at least a minute to execute.

If you havve 10TB of input data and a blocksize of 128MB, you’ll end up with 82,000 maps, unless Configuration.set(MRJobConfig.NUM_MAPS, int) (which only provides a hint to the framework) is used to set it even higher.

edited Nov 29 '16 at 10:21

Ravindra babu

37,698
11
250
211

answered Nov 29 '16 at 09:32

Ankur Singh

926
2
9
19

I have removed last part of the post regarding confirmation. Since you are quoting content from wiki, it can't be wrong. – Ravindra babu Nov 29 '16 at 10:22
Play with words,Nice...!!. Ok agree. – Ankur Singh Nov 29 '16 at 10:26
https://hadoop.apache.org/docs/stable/hadoop-mapreduce-client/hadoop-mapreduce-client-core/MapReduceTutorial.html confirms same regarding number of Maps – Ravindra babu Nov 29 '16 at 10:31
I agree with your above comment then what do you want to prove to me ? – Ankur Singh Nov 29 '16 at 10:35
I know the link you commented. I also refer the link for "No of mapper" in the Answer too. – Ankur Singh Nov 29 '16 at 10:37
Hmm. Nothing to prove. just provided official documentation link. – Ravindra babu Nov 29 '16 at 10:38
Ok Done..!! Thanks – Ankur Singh Nov 29 '16 at 10:39
Thank you for answering. Regarding my first question I want to know how to set (not to get) the number of DataNodes. Regarding the third question you refered "the total number of blocks of the input files." I don't agree with this because till now people and documentations are mixing between the blocks (physical) and the splits(logical) which are different and I think the correct is "the total number of splits of the input files." not the bocks, because the splits determines the definite input to the mappers not the blocks. Do you agree with me? – Mosab Shaheen Nov 29 '16 at 17:27
I agree with your answer regarding your 3rd question. http://stackoverflow.com/questions/17727468/hadoop-input-split-size-vs-block-size – Ankur Singh Nov 29 '16 at 22:36
But for 1st question what do you mean by "I want to know how to set (not to get) the number of DataNodes" ? – Ankur Singh Nov 29 '16 at 22:36
If you ask How to set DataNode ? Answer: You need to install Hadoop on that node and configure as Datanode and for Namenode give the address of your master node. – Ankur Singh Nov 29 '16 at 22:43
Welcome @mosab – Ankur Singh Dec 02 '16 at 19:10

Number of DataNodes and MapTasks in Hadoop

1 Answers1

H=crS/(1-i)

n= H/d = crS/(1-i)*d