let say I have 10 higher end processing machine as DataNode and another 20 lower end hardware. If we keep higher chunks of blocks in HDFS of those 10 machines can it process faster?
Short Answer
HDFS block is the basic unit of data-parallelism in hadoop. i.e. one block of HDFS is processed by one CPU core. Having different block sizes 64MB, 128MB, 256MB etc for the same file depending on the processing power of the DataNode will not help as each HDFS block will be processed by one core. Even the more powerful machines will have more CPU cores rather than faster CPU cores (CPU core's clock speeds have maxed out at around 2.5 to 3.5 GHz in the last decade).
For certain files (or file types like Parquet) which are much more dense, it makes sense to have larger block sizes. But it certainly doesn't make sense to split that one file in variable sizes of HDFS blocks based on the DataNode). This is probably why hadoop designers decided to have consistent block sizes.
Long Answer
You mentioned higher end processing machine. Nowadays, faster machine means CPUs with more cores than CPUs with higher clock speeds (GHz). Clocks speeds have almost reached a limit since quite some time (almost a decade). The speeds have peaked around 2.5 to 3.5 GHz.
The frameworks that run on HDFS e.g. MapReduce, Spark and others, one block of HDFS is processed by one CPU core. So, the bigger blocks would still be processed by 1 core within those bigger machines. That will make those tasks run much slower.
Even with the higher end processing machines, the per-CPU-core processing power will be same as normal nodes. Storing larger blocks on nodes with higher number of cores will not help (the processing power of individual cores within those boxes will be similar to that of the smaller/normal nodes).
Besides, there are some other reasons why hadoop designers would have decided against it...
Specifying block-size is allowed as a cluster-wide setting as @cricket_007 mentioned, as well as overridable on a per-file basis using dfs.blocksize.
The following may be some of the driving factors why for one file all blocks are of consistent size.
- Simplify Configuration - how would you specify blocksize per datanode per file? maybe nodes with 2x cores than normal nodes should have 2x blocksizes.. etc. This would make configuration very hard.
- Avoid data skew - Having some blocks larger than other would introduce data skew. This has direct impact on how data processing frameworks will handle those files (which have variable blocksizes depending on the node).
- Simplify replication - Imagine hadoop cluster-wise replication factor was configured to be 3. So, for each block - there needs to be total 3 copies. If the block sizes were dependent on datanode size (compute power), it would make it mandatory to have atleast as many number of nodes with similar computing power as the replication factor. If there were only 3 big nodes and 10 normal nodes, all big blocks would need to be on big nodes.
- Simplify fail-over - Imagine one of the big nodes going down, hadoop will not be able to find another big node where it can make a copy of those extra big blocks to keep up with the replication factor. (We only had 3 big nodes, one of them has gone down). Ultimately, if it copies those big blocks to normal nodes, it will introduce skew in terms of processing power vs block size and affect performance of the data processing jobs. Another alternative is to split up the big blocks when moving to normal nodes, again this is extra complexity
- Get predictable performance - skew in the data means it is very hard to get predictable performance.
These are possibly some reasons which introduce too many complexities and hence this feature is not supported.