Hadoop Data Node: why is there a magic "number" for threshold of data blocks?

Question

Experts,

We may see our block count grow in our hadoop cluster. "Too many" blocks have consequences such as increased heap requirements at data node, declining execution speeds, more GC etc. We should take notice when the number of blocks exceed a certain "threshold".

I have seen different static numbers for thresholds such as 200,000 or 500,000 -- "magic" numbers. Shouldn't it be a function of memory of node (Java Heap Size of DataNode in Bytes)?

Other interesting related questions:

What does a high block count indicate? a. too many small files? b. running out of capacity? is it (a) or (b)? how to differentiate between the two?
What is a small file? A file whose size is smaller than block size (dfs.blocksize)?
Does each file take a new data block on disk? or is it the meta data associated with new file that is the problem?
The effects are more GC, declising execution speeds etc. How to "quantify" the effects of high block count?

Thanking in advance

score 3 · Answer 1 · answered May 23 '17 at 14:42

Thanks everyone for their input. I have done some research on the topic and share my findings.

any static number is a magic number. I propose the number of block threshold to be: heap memory (in gb) x 1 million * comfort_%age (say 50%)

Why? Rule of thumb: 1gb for 1M blocks, Cloudera [1]

The actual amount of heap memory required by namenode turns out to be much lower. Heap needed = (number of blocks + inode (files + folders)) x object size (150-300 bytes [1])

For 1 million small files: heap needed = (1M + 1M) x 300b = 572mb <== much smaller than rule of thumb.

High block count may indicate both. namenode UI states the heap capacity used.

For example, http://namenode:50070/dfshealth.html#tab-overview 9,847,555 files and directories, 6,827,152 blocks = 16,674,707 total filesystem object(s). Heap Memory used 5.82 GB of 15.85 GB Heap Memory. Max Heap Memory is 15.85 GB.

** Note, the heap memory used is still higher than 16,674,707 objects x 300 bytes = 4.65gb

To find out small files, do hdfs fsck -blocks | grep "Total blocks (validated):" It would return something like: Total blocks (validated): 2402 (avg. block size 325594 B) <== which is smaller than 1mb

yes. a file is small if its size < dfs.blocksize.
- each file takes a new data block on disk, though the block size is close to file size. so small block.
- for every new file, inode type object is created (150B), so stress on heap memory of name node

Impact on name and data nodes: Small files pose problems for both name node and data nodes: name nodes: - Pull the ceiling on number of files down as it needs to keep metadata for each file in memory - Long time in restarting as it must read the metadata of every file from a cache on local disk

data nodes: - large number of small files means a large amount of random disk IO. HDFS is designed for large files, and benefits from sequential reads.

[1] https://www.cloudera.com/documentation/enterprise/5-8-x/topics/admin_nn_memory_config.html

score 1 · Answer 2 · edited May 23 '17 at 12:01

Your first assumption is wrong, since Data node does not maintain the data file structure in memory, it is the job of the Name node to keep track of the filesystem (recurring to INodes) in memory. So the small files will actually cause your Name node do run out memory faster (since more metadata will be required to represent the same amount of data) and the execution speed will be affected since the Mapper is created per block.

To have an answer your first question check: Namenode file quantity limit
Execute the following command: hadoop fs -du -s -h. If you see that the first value (which represents the average file size of all files) is much smaller than the configured block size, then you are facing the problem of the small files. To check if you are running out of space: hadoop fs -df -h
Yup, can be much smaller. Sometimes though if the file is too big, it would require additional block. Once the block is reserved for some file it cannot be used by another files.
The block does not reserve the space on the disk beyond what it does actually need to store the data, it is metadata on the namenode which imposes the limits.
As I told before, it is more mapper tasks which need to be executed for the same amount of data. Since the mapper is ran on new JVM, the GC is not a problem, but the overhead of starting it for processing the tiny amount of data is the problem.

One mapper per block. That's another problem posed by small files. Thanks for mentioning. — Dr.Rizz, May 31 '17 at 15:32

Hadoop Data Node: why is there a magic "number" for threshold of data blocks?

2 Answers2