Why can't hadoop split up a large text file and then compress the splits using gzip?

Question

I've recently been looking into hadoop and HDFS. When you load a file into HDFS, it will normally split the file into 64MB chunks and distribute these chunks around your cluster. Except it can't do this with gzip'd files because a gzip'd file can't be split. I completely understand why this is the case (I don't need anyone explaining why a gzip'd file can't be split up). But why couldn't HDFS take a plain text file as input and split it like normal, then compress each split using gzip separately? When any split is accessed, it's just decompressed on the fly.

In my scenario, each split is compressed completely independently. There's no dependencies between splits, so you don't need the entire original file to decompress any one of the splits. That is the approach this patch takes: https://issues.apache.org/jira/browse/HADOOP-7076, note that this is not what I'd want.

This seems pretty basic... what am I missing? Why couldn't this be done? Or if it could be done, why have the hadoop developers not looked down this route? It seems strange given how much discussion I've found regarding people wanting split gzip'd files in HDFS.

I just wanted to add a comment to this question. What I'm thinking of would be exactly like what git does with its objects in the object store. Every single blob, commit, and tree object is zlib compressed just prior to being saved to disk. This is regardless of what the actual object was, and no tools that work 'above' the git plumbing need to know anything about the compression format. — onlynone, Feb 24 '17 at 21:57

score 9 · Answer 1 · edited May 23 '17 at 12:10

9

The simple reason is the design principle of "separation of concerns".

If you do what you propose then HDFS must know what the actual bits and bytes of the file mean. Also HDFS must be made able to reason about it (i.e. extract, decompress, etc.). In general you don't want this kind of mixing up responsibilities in software.

So the 'only' part that is to understand what the bits mean is the application that must be able to read it: which is commonly written using the MapReduce part of Hadoop.

As stated in the Javadoc of HADOOP-7076 (I wrote that thing ;) ):

Always remember that there are alternative approaches:

Decompress the original gzipped file, split it into pieces and recompress the pieces before offering them to Hadoop.
For example: Splitting gzipped logfiles without storing the ungzipped splits on disk

Decompress the original gzipped file and compress using a different splittable codec. For example BZip2Codec or not compressing at all.

HTH

edited May 23 '17 at 12:10

Community

1
1

answered Jun 29 '11 at 15:09

Niels Basjes

10,424
9
50
66

Hadoop wouldn't have to know what the bits mean any more than it has to for splittable bzip2. I'm simply talking about how the data is stored after the splits have been done. So, split the file into 67108864 byte chunks (without knowing anything about these bits and bytes) and then compress each block right before storage. I guess I'm thinking of it more as a storage back-end format, than actual file format. That way, absolutely any compression algorithm could be used. – onlynone Jun 30 '11 at 19:15
1

Also, bzip2 isn't really splittable until 0.21, which isn't stable. And who knows when 0.22 will be released. – onlynone Jun 30 '11 at 19:22
The 64M chunks at HDFS level have only to do with how these files are placed on the datanodes of the cluster. Also if you have gzipped files which are bigger this will have a negative impact on the performance of jobs on those GZipped files. – Niels Basjes Jul 01 '11 at 08:24
HDFS doesnt know whats the bits and bytes of the file mean too. to HDFS there is not a lot of difference between a HDFS block and a HDFS file. there are some files which resides in only one block .. – rogue-one Jun 11 '18 at 18:21

rogue-one · Answer 2 · 2018-06-14T01:39:02.010

1

The HDFS has a limited scope of being only a distributed file-system service and doesn't do heavy-lifting operations such as compressing the data. The actual process of data compression is delegated to distributed execution frameworks like Map-Reduce, Spark, Tez etc. So compression of data/files is the concern of the execution framework and not that of the File System.

Additionally the presence of container file formats like Sequence-file, Parquet etc negates the need of HDFS to compress the Data blocks automatically as suggested by the question.

So to summarize due to design philosophy reasons any compression of data must be done by the execution engine not by the file system service.

edited Jun 14 '18 at 01:39

answered Jun 13 '18 at 13:58

rogue-one

11,259
7
53
75

I guess to implement my idea, you could simply use a file system that supports transparent compression for all the volumes that the datanodes use to store the blocks. – onlynone Jun 15 '18 at 17:01
you mean hardware based compression?... not including this feature is likely a conscious decision by HDFS implementers. they may include it in the future.. – rogue-one Jun 15 '18 at 20:04
not hardware compression. a number of filesystems like zfs can do transparent compression. – onlynone Jun 20 '18 at 14:45

Why can't hadoop split up a large text file and then compress the splits using gzip?

2 Answers2

Linked

Related