How to use Snappy in Hadoop in Container format

Question

I have to use Snappy to compress the map o/p and the map-reduce o/p as well. Further, this should be splittable.

As I studied online, to make Snappy write splittable o/p, we have to use it in a Container like format.

Can you please suggest how to go about it? I tried finding some examples online, but could not fine one. I am using Hadoop v0.20.203.

Thanks. Piyush

score 5 · Answer 1 · answered Apr 25 '12 at 05:10

5

for output

conf.setOutputFormat(SequenceFileOutputFormat.class); SequenceFileOutputFormat.setOutputCompressionType(conf, CompressionType.BLOCK); SequenceFileOutputFormat.setCompressOutput(conf, true); conf.set("mapred.output.compression.codec","org.apache.hadoop.io.compress.SnappyCodec");

For map output

Configuration conf = new Configuration(); conf.setBoolean("mapred.compress.map.output", true); conf.set("mapred.map.output.compression.codec","org.apache.hadoop.io.compress.SnappyCodec");

answered Apr 25 '12 at 05:10

root1982

470
2
4
10

Thanks. However, I am not using the Sequence file format, but BufferedWriter. So, can you suggest how to do it. – Piyush Kansal Apr 25 '12 at 07:19
1

"One thing to note is that Snappy is intended to be used with a container format, like Sequence Files or Avro Data Files, rather than being used directly on plain text, for example, since the latter is not splittable and can’t be processed in parallel using MapReduce." (http://www.cloudera.com/blog/2011/09/snappy-and-hadoop/) – root1982 Apr 25 '12 at 20:40
The data we are going to compress using Snappy will not be passed further to any MapReduce job, it will just stay on the disk. So, we just want to use it for compression and measure the difference b/w Gzip and Snappy in terms of compression ration and execution time. So it is okay with me even it is not splittable. – Piyush Kansal Apr 27 '12 at 00:07
I think it should be OK then. – root1982 May 03 '12 at 01:38

score 1 · Answer 2 · answered Mar 03 '15 at 12:18

In the new API OutputFormat installing for the Job, and not for the configuration. Then, first part will be:

Job job = new Job(conf);
...
SequenceFileOutputFormat.setOutputCompressionType(job, CompressionType.BLOCK);
SequenceFileOutputFormat.setCompressOutput(job, true);

conf.set("mapred.output.compression.codec","org.apache.hadoop.io.compress.SnappyCodec");

How to use Snappy in Hadoop in Container format

2 Answers2

Linked