4

I have to use Snappy to compress the map o/p and the map-reduce o/p as well. Further, this should be splittable.

As I studied online, to make Snappy write splittable o/p, we have to use it in a Container like format.

Can you please suggest how to go about it? I tried finding some examples online, but could not fine one. I am using Hadoop v0.20.203.

Thanks. Piyush

Piyush Kansal
  • 1,201
  • 4
  • 18
  • 26

2 Answers2

5

for output

conf.setOutputFormat(SequenceFileOutputFormat.class); SequenceFileOutputFormat.setOutputCompressionType(conf, CompressionType.BLOCK); SequenceFileOutputFormat.setCompressOutput(conf, true); conf.set("mapred.output.compression.codec","org.apache.hadoop.io.compress.SnappyCodec");

For map output

Configuration conf = new Configuration(); conf.setBoolean("mapred.compress.map.output", true); conf.set("mapred.map.output.compression.codec","org.apache.hadoop.io.compress.SnappyCodec");

root1982
  • 470
  • 2
  • 4
  • 10
  • Thanks. However, I am not using the Sequence file format, but BufferedWriter. So, can you suggest how to do it. – Piyush Kansal Apr 25 '12 at 07:19
  • 1
    "One thing to note is that Snappy is intended to be used with a container format, like Sequence Files or Avro Data Files, rather than being used directly on plain text, for example, since the latter is not splittable and can’t be processed in parallel using MapReduce." (http://www.cloudera.com/blog/2011/09/snappy-and-hadoop/) – root1982 Apr 25 '12 at 20:40
  • The data we are going to compress using Snappy will not be passed further to any MapReduce job, it will just stay on the disk. So, we just want to use it for compression and measure the difference b/w Gzip and Snappy in terms of compression ration and execution time. So it is okay with me even it is not splittable. – Piyush Kansal Apr 27 '12 at 00:07
  • I think it should be OK then. – root1982 May 03 '12 at 01:38
1

In the new API OutputFormat installing for the Job, and not for the configuration. Then, first part will be:

Job job = new Job(conf);
...
SequenceFileOutputFormat.setOutputCompressionType(job, CompressionType.BLOCK);
SequenceFileOutputFormat.setCompressOutput(job, true);

conf.set("mapred.output.compression.codec","org.apache.hadoop.io.compress.SnappyCodec");
VeLKerr
  • 2,995
  • 3
  • 24
  • 47