What is the difference between Block, chunk and file split in Hadoop?

Question

please clarify me 1)what is the difference between chunk,block and file split in Hadoop?? 2)what is the internal process of $hadoop fs -put command ?

score 2 · Accepted Answer · answered Apr 10 '15 at 10:34

Block : Hdfs talks in terms of blocks for eg : if you have file of 256 mb and you have configured your block size is 128 mb so now 2 blocks gets created for 256 mb.

Block size is configurable across the cluster and even file basis also.

Split : It has something related with map reduce , you do have an option that you can change the split size , means you can modify your split size greater than your block size or your split size less than your block size . By default if you don't do any configuration then your split size is approximately equal to block size .

In map reduce processing, number of mapper spawned would be equal to your number of splits : for a file if 10 splits are there then 10 mappers would be spawned.

When put command is being fired , it goes to namenode , namenode asks client (in this case hadoop fs utility is behaving like a client) , break the file into blocks and as per block size , which could be defined in hdfs-site.xml then ,namenode ask client to write the different blocks to different data nodes .

Actual data will get store on data nodes and meta data of data means file's block location and file attributes would be stored on name node .

client first establish the connection with name node , once it gets the confirmation about where to store the block and then it would directly make a tcp connection with data nodes and writes the data .

Based on replication factor other copies would be maintained in hadoop cluster and their blocks information would be stored on namenode .

But in any scenario data node won't have duplicate copies of block , means same block would not be replicating on the same node .

score 0 · Answer 2 · answered Apr 09 '15 at 21:48

A chunk, block or a file split are are referring to the same thing. That is that HDFS splits the file by block size (usually 128 or 256 MB) which themselves are replicated a configurable (usually 3) number of times.

As for the put command, ultimately you are creating a pipeline where the client is told by the NameNode (for each block) which DataNode to copy it to and that DN then copies it to a friend who in turns copies it to a friend. There's a small write up in the "Replication Pipelining" section of https://hadoop.apache.org/docs/r2.2.0/hadoop-project-dist/hadoop-hdfs/HdfsDesign.html. There's a visual on slide 14 of http://www.slideshare.net/lestermartin/hadoop-demystified as well.

HDFS blocks and MapReduce splits are closely related but not the same - refer to the Definitive Guide section quoted here http://stackoverflow.com/q/14291170/95 - records do not necessarily fit nicely into blocks. — Marek Grzenkowicz, Apr 10 '15 at 22:05

What is the difference between Block, chunk and file split in Hadoop?

2 Answers2