0

please clarify me 1)what is the difference between chunk,block and file split in Hadoop?? 2)what is the internal process of $hadoop fs -put command ?

Marek Grzenkowicz
  • 17,024
  • 9
  • 81
  • 111

2 Answers2

2

Block : Hdfs talks in terms of blocks for eg : if you have file of 256 mb and you have configured your block size is 128 mb so now 2 blocks gets created for 256 mb.

Block size is configurable across the cluster and even file basis also.

Split : It has something related with map reduce , you do have an option that you can change the split size , means you can modify your split size greater than your block size or your split size less than your block size . By default if you don't do any configuration then your split size is approximately equal to block size .

In map reduce processing, number of mapper spawned would be equal to your number of splits : for a file if 10 splits are there then 10 mappers would be spawned.

When put command is being fired , it goes to namenode , namenode asks client (in this case hadoop fs utility is behaving like a client) , break the file into blocks and as per block size , which could be defined in hdfs-site.xml then ,namenode ask client to write the different blocks to different data nodes .

Actual data will get store on data nodes and meta data of data means file's block location and file attributes would be stored on name node .

client first establish the connection with name node , once it gets the confirmation about where to store the block and then it would directly make a tcp connection with data nodes and writes the data .

Based on replication factor other copies would be maintained in hadoop cluster and their blocks information would be stored on namenode .

But in any scenario data node won't have duplicate copies of block , means same block would not be replicating on the same node .

user3484461
  • 1,113
  • 11
  • 14
0

A chunk, block or a file split are are referring to the same thing. That is that HDFS splits the file by block size (usually 128 or 256 MB) which themselves are replicated a configurable (usually 3) number of times.

As for the put command, ultimately you are creating a pipeline where the client is told by the NameNode (for each block) which DataNode to copy it to and that DN then copies it to a friend who in turns copies it to a friend. There's a small write up in the "Replication Pipelining" section of https://hadoop.apache.org/docs/r2.2.0/hadoop-project-dist/hadoop-hdfs/HdfsDesign.html. There's a visual on slide 14 of http://www.slideshare.net/lestermartin/hadoop-demystified as well.

Lester Martin
  • 311
  • 1
  • 5
  • HDFS blocks and MapReduce splits are closely related but not the same - refer to the Definitive Guide section quoted here http://stackoverflow.com/q/14291170/95 - records do not necessarily fit nicely into blocks. – Marek Grzenkowicz Apr 10 '15 at 22:05