Why blocksize in HDFS is consistent in all the DataNode?

Question

In continuation to question: data block size in HDFS, why 64MB?

I know that blocksize in HDFS is consistent/same in all the Data Node (size depends on configuration) in a distribution.

My question is: Why this blocksize is kept consistent in all the NameNode?

I am asking this questions because, let say I have 10 higher end processing machine as DataNode and another 20 lower end hardware. If we keep higher chunks of blocks in HDFS of those 10 machines can it process faster? Also NameNode have the metadata to identify the blocks in DataNode so what is the problem with inconsistent block size among machines?

Your last sentence is a true statement, did you intend that to be a question? — OneCricketeer, Jun 10 '16 at 08:31
yes, that is related to my question only. can you please elaborate why decision was made to have consitent Block size in all the data node? — Vishrant, Jun 10 '16 at 08:33
The `dfs.blocksize` value is a cluster-wide setting from `hdfs-site`, as far as I know. While you *can* put individual files with custom blocksizes, I'm not aware of any mechanism for "balancing" blocks across higher-end machines — OneCricketeer, Jun 10 '16 at 08:36
@cricket_007 I came to know that we can have only `cluster-wide` setting with `dfs-blocksize` but not the custom setting for individual data node for custom sized block, if you have any idea about custom block size please elaborate — Vishrant, Jun 10 '16 at 08:41
I have only seen it done when you `hdfs fs -put` a file. Which does not specify which DataNode to place the file, nor should it. For reference: http://stackoverflow.com/questions/2669800/changing-the-block-size-of-a-dfs-file-in-hadoop — OneCricketeer, Jun 10 '16 at 08:44
@cricket_007 after reading that question it creates one more scenario in my mind. Let say we keep blocksize of 64MB but after processing data we realize that this blocksize is not efficient. Then is their any way to change the blocksize or we will have to copy the whole data again after configuring to some other figure? — Vishrant, Jun 10 '16 at 08:52
That would appear to also be answered. http://stackoverflow.com/questions/29604823/change-block-size-of-existing-files-in-hadoop — OneCricketeer, Jun 10 '16 at 08:59
@cricket_007 thanks a lot, it cleared my last doubt, but still my primary question remains same: Why blocksize in HDFS is consistent in all the DataNode? — Vishrant, Jun 10 '16 at 09:15
Because it is a cluster-wide setting and there exists no mechanism or configuration that manages individual datanodes/namenodes? I thought we went over that — OneCricketeer, Jun 10 '16 at 09:18

score 2 · Accepted Answer · edited Jun 20 '20 at 09:12

let say I have 10 higher end processing machine as DataNode and another 20 lower end hardware. If we keep higher chunks of blocks in HDFS of those 10 machines can it process faster?

Short Answer

HDFS block is the basic unit of data-parallelism in hadoop. i.e. one block of HDFS is processed by one CPU core. Having different block sizes 64MB, 128MB, 256MB etc for the same file depending on the processing power of the DataNode will not help as each HDFS block will be processed by one core. Even the more powerful machines will have more CPU cores rather than faster CPU cores (CPU core's clock speeds have maxed out at around 2.5 to 3.5 GHz in the last decade).

For certain files (or file types like Parquet) which are much more dense, it makes sense to have larger block sizes. But it certainly doesn't make sense to split that one file in variable sizes of HDFS blocks based on the DataNode). This is probably why hadoop designers decided to have consistent block sizes.

Long Answer

You mentioned higher end processing machine. Nowadays, faster machine means CPUs with more cores than CPUs with higher clock speeds (GHz). Clocks speeds have almost reached a limit since quite some time (almost a decade). The speeds have peaked around 2.5 to 3.5 GHz.

The frameworks that run on HDFS e.g. MapReduce, Spark and others, one block of HDFS is processed by one CPU core. So, the bigger blocks would still be processed by 1 core within those bigger machines. That will make those tasks run much slower.

Even with the higher end processing machines, the per-CPU-core processing power will be same as normal nodes. Storing larger blocks on nodes with higher number of cores will not help (the processing power of individual cores within those boxes will be similar to that of the smaller/normal nodes).

Besides, there are some other reasons why hadoop designers would have decided against it...

Specifying block-size is allowed as a cluster-wide setting as @cricket_007 mentioned, as well as overridable on a per-file basis using dfs.blocksize.

The following may be some of the driving factors why for one file all blocks are of consistent size.

Simplify Configuration - how would you specify blocksize per datanode per file? maybe nodes with 2x cores than normal nodes should have 2x blocksizes.. etc. This would make configuration very hard.
Avoid data skew - Having some blocks larger than other would introduce data skew. This has direct impact on how data processing frameworks will handle those files (which have variable blocksizes depending on the node).
Simplify replication - Imagine hadoop cluster-wise replication factor was configured to be 3. So, for each block - there needs to be total 3 copies. If the block sizes were dependent on datanode size (compute power), it would make it mandatory to have atleast as many number of nodes with similar computing power as the replication factor. If there were only 3 big nodes and 10 normal nodes, all big blocks would need to be on big nodes.
Simplify fail-over - Imagine one of the big nodes going down, hadoop will not be able to find another big node where it can make a copy of those extra big blocks to keep up with the replication factor. (We only had 3 big nodes, one of them has gone down). Ultimately, if it copies those big blocks to normal nodes, it will introduce skew in terms of processing power vs block size and affect performance of the data processing jobs. Another alternative is to split up the big blocks when moving to normal nodes, again this is extra complexity
Get predictable performance - skew in the data means it is very hard to get predictable performance.

These are possibly some reasons which introduce too many complexities and hence this feature is not supported.

You are making too many assumptions for any useful answer to his vague question. CPU frequency does not matter for all tasks, some tasks are bandwidth or latency limited. And BTW, you can process one block with multiple threads (see multithreaded mappers for example). — Thomas Jungblut, Jun 10 '16 at 12:06

Why blocksize in HDFS is consistent in all the DataNode?

1 Answers1

Short Answer

Long Answer