3

HDFS storage support compression format to store compressed file. I know that gzip compression doesn't support splinting. Imagine now the file is a gzip-compressed file whose compressed size is 1 GB. Now my question is:

  1. How this file will get stored in HDFS (Block size is 64MB)

From this link I came to know that The gzip format uses DEFLATE to store the compressed data, and DEFLATE stores data as a series of compressed blocks.

But I couldn't understand it completely and looking for broad explanation.

More doubts from gzip compressed file:

  1. How many block will be there for this 1GB gzip compressed file.
  2. Will it go on multiple datanode ?
  3. How replication factor will be applicable for this file ( Hadoop cluster replication factor is 3.)
  4. What is DEFLATE algorithm?
  5. Which algorithm is applied while reading the gzip compressed file?

I am looking here broad and detailed explanation.

Sandeep Singh
  • 7,790
  • 4
  • 43
  • 68
  • A file in a file system does not have to be contiguous on disk, whether the disk is one physical disk, or many disks in a distributed file system. The file system divides the file into blocks, which it stores wherever it decides to store it. When an application requests a file, the file system knows the mapping to the blocks and where the blocks are. It sends an I/O request to retrieve them, then the file system pieces the blocks back into the file. This division of large things is kind of the whole point. A distributed system can pool resources to do things a single system couldn't do alone. – e0k Jan 22 '16 at 18:50

1 Answers1

2

How this file will get stored in HDFS (Block size is 64MB) if splitting does not supported for zip file format?

All DFS blocks will be stored in single Datanode. If your block size is 64 MB and file is 1 GB, the Datanode with 16 DFS blocks ( 1 GB / 64 MB = 15.625) will store 1 GB file.

How many block will be there for this 1GB gzip compressed file.

1 GB / 64 MB = 15.625 ~ 16 DFS blocks

How replication factor will be applicable for this file ( Hadoop cluster replication factor is 3.)

Same as of any other file. If the file is splittable, no change. If the file is not splittable, Datanodes with required number of blocks will be identified. In this case, 3 datanodes with 16 available DFS blocks.

What is DEFLATE algorithm?

DELATE is the algorithm to uncompress zipped files of GZIP format.

Ravindra babu
  • 37,698
  • 11
  • 250
  • 211
  • Thanks for your detail answer. if the gzip compressed file gets splitted during the storage on HDFS, then again why it says that gzip doesn't support splitting and why it is processed through single mapper. Can you clarify this too – Sandeep Singh Jan 23 '16 at 06:43
  • Gzip is not splitable and hence one mapper will process 16 blocks of 1 GB file. – Ravindra babu Jan 23 '16 at 07:34
  • Have a look at : http://stackoverflow.com/questions/5630245/hadoop-gzip-compressed-files – Ravindra babu Jan 23 '16 at 07:52
  • So, we can understand that gzip file are splittable with a sequential series of blocks while storing the data into HDFS. But due to certain limitation in mapReduce it not supporting parallel processing and hence we can say that gzip is not splittable during MapReduce processing. Right ? – Sandeep Singh Jan 23 '16 at 08:55
  • Yes. Exactly. Single Mapper has to process all blocks of gzip file. – Ravindra babu Jan 23 '16 at 09:10
  • I am trying to search more information while processing the gzip file. If you have more reference, can you post it here ? – Sandeep Singh Jan 23 '16 at 09:35
  • This link may help you: http://www.javased.com/index.php?api=org.apache.hadoop.io.compress.GzipCodec – Ravindra babu Jan 23 '16 at 09:43
  • "All DFS blocks will be stored in single Datanode" - this isn't true unfortunately. HDFS blocks of a multi-block gzip file can be stored on multiple datanodes. They'll indeed will have to be processed by a single mapper, but that mapper will do remote reads to collect all remote blocks before de-compressing. – Jakub Kukul Jul 25 '18 at 08:56