7

As I have explored, journal files created by Mongodb is compressed using snappy compression algorithm. but I am not able to decompress this compressed journal file. It gives an error on trying to decompress

Error stream missing snappy identifier

the python code I have used to decompress is as follows:

import collections
import bson
from bson.codec_options import CodecOptions
import snappy
from cStringIO import StringIO
try:
    with open('journal/WiredTigerLog.0000000011') as f:
        content = f.readlines()
        fh = StringIO()
        snappy.stream_decompress(StringIO("".join(content)),fh)
        print fh
except Exception,e:
    print str(e)
    pass

please help i can't make my way after this

stackMonk
  • 1,033
  • 17
  • 33
  • 1
    Maybe your journal isn't compressed. Try to open it in a hex-editor and see if you can read your plain data. – Reto Aebersold Feb 15 '17 at 17:11
  • Ditto what @RetoAebersold said. [It seems to not be finding the expected Snappy header](https://github.com/andrix/python-snappy/blob/master/snappy.py#L213). – Dan Feb 17 '17 at 21:01
  • Tried your code snippet and it worked on framed snappy data. Adding to what others noted, if you open the file in a hex editor, it should be apparent whether it's snappy framed data. The signature is (starting at file offset zero): `\377\006\0\0sNaPpY` as from *nix magic file or `ff06 0000 734e 6150 7059` in hex. Perhaps the WiredTiger Storage Engine is writing using a [different compression](https://docs.mongodb.com/manual/core/wiredtiger/#compression) option? – Lex Scarisbrick Feb 19 '17 at 16:58

1 Answers1

1

There's two forms of Snappy compression, the basic form and the streaming form. The basic form has the limitation that it all must fit in memory, so the streaming form exists to be able to compress larger amounts of data. The streaming format has a header and then subranges that are compressed. If the header is missing, it sounds like maybe you compressed using the basic form and are trying to uncompress with the streaming form. https://github.com/andrix/python-snappy/issues/40

If that is the case, use decompress instead of stream_decompress.

But if could be that the data isn't compressed at all:

with open('journal/WiredTigerLog.0000000011') as f:
    for line in f:
        print line

could work.

Minimum log record size for WiredTiger is 128 bytes. If a log record is 128 bytes or smaller, WiredTiger does not compress that record. https://docs.mongodb.com/manual/core/journaling/

Hugues Fontenelle
  • 5,275
  • 2
  • 29
  • 44
  • Since WiredTiger only compresses records which are larger that 128 bytes. How will we detect that which lines are compressed and which are not? – Cybersupernova Feb 20 '17 at 14:53