5

HDFS is a logical filesystem in Hadoop with a Block size of 64MB. A file on HDFS is saved on the underlying OS filesystem, say ext4 with 4KiB as the block size.

To my knowledge, for a file on the local file system, OS uses start and end cylinders of the physical hard disk of the 4KiB block for its retrieval. HDFS files are also saved on the ext4 underlying filesystem. The HDFS files are also to be retrieved with the help of 4KiB blocks start and end cylinders only.

If that is the case, this won't increase the speed of data retrieval. Now the question is, what is the technique used in HDFS wrt hard disk for increasing its retrieval speed?

Hadi
  • 36,233
  • 13
  • 65
  • 124
Shashikanth Komandoor
  • 781
  • 1
  • 11
  • 29

1 Answers1

2

The retrieval speed from the ext filesystem isn't changed as you are thinking it very correctly. But what happens is a large file is split into pieces of 64Mb, say, which are stored on different machines. So when the retrieval call is made, multiple machines read the file pieces simultaneously and report to the main machine (Name node). This way, things speed up. It is the same as ten men finishing a building task in 1 day rather than one man in 10 days.

Hadi
  • 36,233
  • 13
  • 65
  • 124
Abhishek
  • 143
  • 10
  • 1
    But what exactly is the meaning of 64Mb block with respect to harddisk? Or Is it something like 64Mb HDFS block means the first 64Mb piece of an HDFS file is saved on the contiguous extents on the harddisk? Or is it something like until a 64Mb piece of the an HDFS file is saved on a Datanode it will not shift to another datanode? How exactly the 64Mb block is being calculated in a HDFS filesystem? – Shashikanth Komandoor Sep 19 '14 at 04:33
  • 1
    I'll provide you the explanation in Hadoop:Definitive guide.Filesystem blocks are an integral multiple of the disk block size which is normally 512 bytes.Filesystems typically have blocks of a few Kib.In HDFS this unit is 64Mb by default.Like in a filesystem for a single disk, files in HDFS are broken into block-sized chunks,which are stored as independent units. Unlike a filesystem for a single disk, a file in HDFS that is smaller than a single block does not occupy a full block’s worth of underlying storage. – Abhishek Sep 19 '14 at 08:07
  • Exactly,a file will be broken into blocks of default size.But if the file is undersized,it will not occupy whole 64Mb.What is visible to us in the datanode will be a *.blk file of either 64Mb or less.If you ask about the large block size, it is to reduce the seek time which would otherwise be longer if we use smaller blocks. – Abhishek Sep 19 '14 at 08:18
  • Here it is to be noted that,HDFS is not an actual filesystem but it uses API access to the underlying filesystem.Yahoo uses ext3 as a base filesystem for hadoop deployments. – Abhishek Sep 19 '14 at 08:33
  • Sorry Abhishek, I am still in confusion. We have a flexibility in hadoop to set the block size as 64Mb or 128Mb or 256Mb basing on the requirement. But what is happening exactly in the backend on setting the bigger block size? For either of the above block size values the disk block size is same right? But what is getting changed on setting a different HDFS block size. – Shashikanth Komandoor Sep 19 '14 at 09:24
  • 1
    The purpose here is to overcome the limitations of traditional file systems which have block sizes of 4-8 Kb.But,in the end it all has to be stored on the disk,which will be based on the disk block size,like you said.On setting up a large filesystem block size,it is assumed that the applications relying on HDFS will perform long sequential streaming reads.I think this will clear your confusion - http://stackoverflow.com/questions/19473772/data-block-size-in-hdfs-why-64mb – Abhishek Sep 19 '14 at 10:04