mapreduce in java - gzip input files

Question

I'm using java, and i'm trying to write a mapreduce that will recieve as an Input a folder containing multiple gz files.

I've been looking all over but all the tutorials that i've found exmplain how to process a simple text File, but haven't found anything that solves my problem.

I've asked around at my workplace, but only got references to scala, which i'm not familier with.

Any help would be appreciated.

score 7 · Accepted Answer · answered Oct 26 '14 at 19:33

7

Hadoop checks the file extension to detect compressed files. The compression types supported by Hadoop are: gzip, bzip2, and LZO. You do not need to take any additional action to extract files using these types of compression; Hadoop handles it for you.

So all you have to do write the logic as you would for a text file and pass in the directory which contains the .gz files as input.

But the issue with gzip files is that they are not splittable, imagine you have gzip files of each 5GB, then each mapper will process on the whole 5GB file instead of working with the default block size.

answered Oct 26 '14 at 19:33

Ashrith

6,745
2
29
36

1

thanks, the only part that i had missing is the fact that code needs to be defined in the job, like this:hadoopConf.set("mapreduce.output.fileoutputformat.compress.codec", "org.apache.hadoop.io.compress.GzipCodec"); also, i understand the issue you brought up, but all of the data we have is compressed to gz, so i guess i'll have to accept it. – Anhermon Oct 26 '14 at 20:34
For reading the input gzip files, you don't have to configure anything from the properties perspective in the driver class. But to compress the output from a mapreduce job you have to specify couple properties: `mapreduce.output.fileoutputformat.compress` which specifies whether to compress the mapreduce output and another property to specify which compression codec to use: `mapreduce.output.fileoutputformat.compress.codec` – Ashrith Oct 26 '14 at 20:42
That's true; although you wouldn't have multiple mappers if table is stored as gzip. It has to be a "splittable" format. – Tagar Nov 10 '15 at 02:41
is lzo compression splittable? – Neethu Lalitha Apr 06 '17 at 19:01
http://blog.cloudera.com/blog/2009/11/hadoop-at-twitter-part-1-splittable-lzo-compression/ – Ashrith Apr 07 '17 at 18:50

mapreduce in java - gzip input files

1 Answers1

Linked