19

According to this Cloudera post, Snappy IS splittable.

For MapReduce, if you need your compressed data to be splittable, BZip2, LZO, and Snappy formats are splittable, but GZip is not. Splittability is not relevant to HBase data.

But from the hadoop definitive guide, Snappy is NOT splittable. enter image description here

There are also some confilitcting information on the web. Some say it's splittable, some say it's not.

moon
  • 1,702
  • 3
  • 19
  • 35
  • Noticed the same thing, interestingly it seems that Cloudera is WRONG. – koders Sep 18 '16 at 15:50
  • 1
    they changes the docs http://www.cloudera.com/documentation/enterprise/latest/topics/admin_data_compression_performance.html so it is splittable but only with container formats – mishkin Oct 30 '16 at 18:12

4 Answers4

33

Both are correct but in different levels.

According with Cloudera blog http://blog.cloudera.com/blog/2011/09/snappy-and-hadoop/

One thing to note is that Snappy is intended to be used with a
container format, like Sequence Files or Avro Data Files, rather than being used directly on plain text, for example, since the latter is not splittable and can’t be processed in parallel using MapReduce. This is different to LZO, where is is possible to index LZO compressed files to determine split points so that LZO files can be processed efficiently in subsequent processing.

This means that if a whole text file is compressed with Snappy then the file is NOT splittable. But if each record inside the file is compressed with Snappy then the file could be splittable, for example in Sequence files with block compression.

To be more clear, is not the same:

<START-FILE>
  <START-SNAPPY-BLOCK>
     FULL CONTENT
  <END-SNAPPY-BLOCK>
<END-FILE>

than

<START-FILE>
  <START-SNAPPY-BLOCK1>
     RECORD1
  <END-SNAPPY-BLOCK1>
  <START-SNAPPY-BLOCK2>
     RECORD2
  <END-SNAPPY-BLOCK2>
  <START-SNAPPY-BLOCK3>
     RECORD3
  <END-SNAPPY-BLOCK3>
<END-FILE>

Snappy blocks are NOT splittable but files with snappy blocks are splittables.

RojoSam
  • 1,476
  • 12
  • 15
4

All splittable codecs in hadoop must implement org.apache.hadoop.io.compress.SplittableCompressionCodec. Looking at the hadoop source code as of 2.7, we see org.apache.hadoop.io.compress.SnappyCodec does not implement this interface, so we know it is not splittable.

qwwqwwq
  • 6,999
  • 2
  • 26
  • 49
4

I have just tested with Spark 1.6.2 on HDFS, for same number of workers/processor, between a simple JSON file and compressed by snappy:

  • JSON: 4 files of 12GB each, Spark creates 388 tasks (1 task by HDFS block) (4*12GB/128MB => 384)
  • Snappy: 4 files of 3GB each, Spark creates 4 tasks

Snappy file is created like this: .saveAsTextFile("/user/qwant/benchmark_file_format/json_snappy", classOf[org.apache.hadoop.io.compress.SnappyCodec])

So Snappy is no splittable with Spark for JSON.

But, if you use parquet (or ORC) file format instead JSON, this will be splitable (even with gzip).

Thomas Decaux
  • 21,738
  • 2
  • 113
  • 124
2

Snappy is actually not splittable as bzip, but when used with file formats like parquet or Avro, instead of compressing the entire file, blocks inside the file format are compressed using snappy.

To understand what is happening when you're compressing a parquet file with snappy compression, check the structure of a parquet file [source link]

enter image description here

Inside a parquet file, records are split into row-groups [basically a subset of rows from the original file], and each row-groups are composed of data-pages [Column chunks in image], each column chunk is composed of many pages where actual records are stored in encoded format[columnar] with metadata. when you enable snappy compression it compresses entire pages! not the entire file. basically you are getting a splittable parquet with snappy compression.

The advantage of snappy is that it is a very light weighted compression codec.

Note: There is a default size limit to row-groups and columns chunks, 128MB and 1MB respectively [you can alter these defaults setting], you can use a different compression codec with parquet e.g. gzip

MikA
  • 5,184
  • 5
  • 33
  • 42