Use gzip input codec on files without .gz extension in hadoop

Question

I'm running a Hadoop job on a bunch of gzipped input files. Hadoop should handle this easily... mapreduce in java - gzip input files

Unfortunately, in my case, the input files don't have a .gz extension. I'm using CombineTextInputFormatClass, which runs my job fine if I point it at non-gzipped files, but I basically just get a bunch of garbage if I point it at the gzipped ones.

I've tried searching for quite some time, but the only thing I've turned up is somebody else asking the same question as I have, with no answer... How to force Hadoop to unzip inputs regadless of their extension?

Anybody got anything?

Have a look at : http://stackoverflow.com/questions/33331366/hadoop-input-split-for-a-compressed-block/33331823#33331823 — Ravindra babu, Oct 28 '15 at 04:42

score 2 · Accepted Answer · answered Nov 01 '15 at 13:15

Went digging in the source and built a solution for this...

You need to modify the source of the LineRecordReader class to modify how it chooses a compression codec. The default version creates a Hadoop CompressionCodecFactory and calls getCodec which parses a file path for its extension. You can instead use getCodecByClassName to obtain any codec you want.

You'll then need to override your input format class to make it use your new record reader. Details here: http://daynebatten.com/2015/11/override-hadoop-compression-codec-file-extension/

score -1 · Answer 2 · answered Oct 28 '15 at 02:14

-1

First gzip files are not splittable. So the result is that your map reduce will not make use of block size while splitting.

Map reduce does not perform splitting when it sees the file extension. Sadly in your case, you are saying that the extension is not .gz. So I am afraid Map reduce is unable to understand how to split the data.

So even though there is an option to know the extension, you would not get good performance. So may be why not uncompress and then provide the data to map reduce, rather than force the map reduce to use compressed format with reduced performance.

answered Oct 28 '15 at 02:14

Ramzy

6,948
6
18
30

Yes, I'm aware of this. The files are small, so my input format is actually combining them rather than splitting them. So, this won't be an issue. – John Chrysostom Oct 28 '15 at 13:31
So you are combining the files which do no have extension. That seems to be another concept, as we need to confirm How CombineTextInputFormatClass handles file without extension. Can you try without it and use regular TextInputFormat and check. This will have performance hit. But anyways, even with .gz, we are not gaining any performance since its not splittable. Can you try with normal text input format. Even I am browsing about How map reduce, knows which techniques to use, when we do not give extension – Ramzy Oct 28 '15 at 14:51
Yes, I've tried with standard `TextInputFormat` and no dice. After a ton of digging, it looks like `LineRecordReader` uses the `CodecFactory` to assign codecs based on filename extensions. Looks like I'm going to have to extend the `LineRecordReader` class, override the `initialize` method, and then also extend all of the input format classes to use my custom `LineRecordReader`. I wish there was a better way, but it looks like there isn't. I'll post the code when I'm done for other's edification. – John Chrysostom Oct 28 '15 at 18:48
Thanks for the information. Even I would be waiting for your post. – Ramzy Oct 28 '15 at 18:52

Use gzip input codec on files without .gz extension in hadoop

2 Answers2