0

How can hdfs have a sequential block of 64MB when the underlying linux filesystem has only 4KB block sizes and a write of 64MB block can not be sequential.

Any thoughts on this? I am not able to get any explanation

Andrei Nicusan
  • 4,555
  • 1
  • 23
  • 36
sethi
  • 1,869
  • 2
  • 17
  • 27

1 Answers1

1

You may be confusing the terms "contiguous" and "sequential". We have sequential reads/writes (from/to disk) and "contiguous" disk space allocation.

A single HDFS block of 64 MB will be written to disk sequentially. Therefore there is a fair chance that the data will be written into contiguous space on disk (consisting of multiple blocks next to each other). So the disk/block fragmentation will be much lower compared to a random disk write.

Furthermore, sequential reads/writes are much faster than random writes with multiple disk seeks. See Difference between sequential write and random write for further information.

Community
  • 1
  • 1
harpun
  • 4,022
  • 1
  • 36
  • 40
  • harpun ty.. Does sequential always mean contiguous? I can't see how it can be but to quote hadoop operation book "Increasing the block size means data will be written in larger contiguous chunks on disk, which in turn means data can be written and read in larger sequential operations.". Also one last doubt is since the disk head is shared across processes a sequential write for one process may still become random write because another process may take the disk head else where. – sethi Jan 29 '14 at 07:27
  • @sethi: sequential disk writes lead to contiguous blocks of data exactly as your book says. As far as the multi-process writes are concerned: disk writes are cached on the software level (operating system) and on the hardware level (disk cache). Furthermore disks writes are optimized so that the head does not seek from one location to another while writing. Hard disks will buffer writes in order to minimize seeks in favor of sequential writes and head movement. – harpun Jan 29 '14 at 19:24