Question : How to set the number of DataNodes in Hadoop?
For setting or calculating the number of DataNodes. Firstly estimate the Hadoop Storage (H) :
H=crS/(1-i)
where:
c = average compression ratio. It depends on the type of compression used (Snappy, LZOP, ...) and size of the data. When no compression is used, c=1.
r = replication factor. It is usually 3 in a production cluster.
S = size of data to be moved to Hadoop. This could be a combination of historical data and incremental data. The incremental data can be daily for example and projected over a period of time (3 years for example).
i = intermediate factor. It is usually 1/3 or 1/4. Hadoop's working space dedicated to storing intermediate results of Map phases.
Example: With no compression i.e. c=1, a replication factor of 3, an intermediate factor of .25=1/4
H= 1*3*S/(1-1/4)=3*S/(3/4)=4*S
With the assumptions above, the Hadoop storage is estimated to be 4 times the size of the initial data size.
Now the formula to estimate the number of Data nodes (n):
n= H/d = crS/(1-i)*d
where:
d = disk space available per node.
Question : "The preferred number of maps around 10-100 maps per-node" so "node" here means NameNode or DataNode?
As you know that MapReduce jobs go to the data for the processing but vice-versa is not true. So here "node" is Data Node.
Question : How Many Maps?
The number of maps is usually driven by the total size of the inputs,
that is, the total number of blocks of the input files.
The right level of parallelism for maps seems to be around 10-100 maps
per-node, although it has been set up to 300 maps for very cpu-light
map tasks. Task setup takes a while, so it is best if the maps take at
least a minute to execute.
If you havve 10TB of input data and a blocksize of 128MB, you’ll end up with 82,000 maps, unless Configuration.set(MRJobConfig.NUM_MAPS, int)
(which only provides a hint to the framework) is used to set it even higher.