As the default Block size of HDFS is 64MB .So if we have say for Example 200MB Data . According to The Block Size of HDFS It will Be Divide into 4 block of 64Mb ,64MB ,64MB and 8MB . My question is that why the data not divided into same 4 block of 50MB in order to save them each one of them in 64MB of block size .
-
You can refer to this link http://stackoverflow.com/questions/19473772/data-block-size-in-hdfs-why-64mb – Kiran Krishna Innamuri Jul 26 '16 at 13:31
2 Answers
why the data not divided into same 4 block of 50MB in order to save them each one of them in 64MB of block size.
- because it is configured to store 64 MB by default in hadoop configurations. you can change it to 50 MB in by changing/adding dfs.block.size
property in hdfs-site.xml
, but HDFS is storage for BIG data processing. default block size is set higher (64MB/128MB) because of
Think about Storing of metadata of these file/blocks in Namenode, more smaller files with increase metadata in Namenode.
- e.g., storing of 1GB file - Namenode have to store metadata for 16 blocks of 64 MB vs. 21 blocks of 50 MB
network overhead when processing files, and hadoop performs better with bigger files:
- e.g., (transfer rate used in this calculation is 1 MB/s, and 10% overhead)
- 3 block of 64 MB, and 1 block of 8 MB takes - 218 sec to transfer over the network
- 4 blocks of 50 MB takes - 220 secs
this 200 MB example is very small in big data world where
TB
of data gets processed in parallel.- e.g., (transfer rate used in this calculation is 1 MB/s, and 10% overhead)
Also NOTE: when last block is stored of 8MB (in your example). This block will occupy only 8MB storage and will not use full 64 MB as block size.

- 3,819
- 1
- 16
- 29
Hadoop operates with large amounts of data. It does not like small files. Having small blocks means more mappers get launched and resources get wasted, also NameNode undergoes huge pressure as it has to keep references to the addresses of each block within your cluster. It would increase the time of accessing the data through the network and would give a significant performance hits.
64Mb was introduced by Apache team as optimal minimal recommended block size, so that it could give reasonable pressure on your namenode and at the same time would allow you to process data in parallel within your MapReduce jobs.
In some Hadoop distributions, like Cloudera, 128Mb block is used by default.