1

I have a file that consists of compressed content plus a 32 byte header. The header contains info such as timestamp, compressed size, and uncompressed size.

The file itself is about 490mb and the header indicates the uncompressed size is close to 2.7gb (it's clearly incorrect, as it also believes the compressed size to be 752mb).

I've stripped the header and generated the compressed payload and can uncompress it with zlib.

The problem is that it is only decompressing 19kb, which is much smaller than 490mb (the bare minimum it should be, but I'm expecting around 700mb uncompressed).

My code is below:

import zlib

def consume (inputFile):
    content = inputFile.read()
    print "Attempting to process " + str(len(content)) + " bytes..."
    outfile = open('output.xml', 'w')
    inputFile = zlib.decompress(content)
    print "Attempting to write " + str(len(inputFile)) + " bytes..."
    outfile.write(inputFile)
    outfile.close()

infile = open('payload', 'rb') 

consume(infile)

infile.close()

When ran, the program outputs:

Attempting to process 489987232 bytes... Attempting to write 18602 bytes...

I've tried to use zlib.decompressionobj(), though this generates an incorrect header warning. zlib.decompress() works fine and produces the decompressed XML that I expect...just far too little of it.

Any pointers or suggestions are greatly appreciated!

jscarto
  • 25
  • 1
  • 7
  • Where did the file come from? Can you re-download it, roll back to a previous version, restore from backup, etc. as appropriate? – abarnert Mar 29 '13 at 00:17
  • Th file definitely sounds corrupt, from the wildly different descriptions of its contents. – nneonneo Mar 29 '13 at 00:24
  • @abarnert The file was provided by a partnering company (over dropbox). I can try to get another and give it a go. Thanks for the pointers - I'd been assuming my code or methods were incorrect, but if it turns out to be the file that will be a major relief! – jscarto Mar 29 '13 at 00:28
  • @jscarto: Did you try using other tools on it? I don't know whether your file (minus the header) is a gz file, a zip file, a raw zlib file, or something else… but if, e.g., `gzip -dc foo.gz >foo` produced a file that was also 18602 bytes, or some other garbage, that would be a good test that your code was doing the right thing. – abarnert Mar 29 '13 at 00:32
  • @abarnert The provider documents it as a "zlib/gz" file (their report makes the two sound equivalent though I know that is not the case). I've tried using `gzip.open('file', 'rb')` in Python but it failed due to an incorrect header. I initially tried the terminal as you suggest, but it also reports an error (not a gzip file). – jscarto Mar 29 '13 at 00:52
  • Well, you have to knock the custom header off. But I think `tail -c +32 foo.decompressed` should do that. (Assuming it really is a gz file.) Anyway, since your platform's `gzip` tool is probably built on the exact same `zlib` library as your Python, this wouldn't really prove anything beyond the fact that your code is using zlib properly… but that's what you wanted to verify, right? – abarnert Mar 29 '13 at 01:12

2 Answers2

3

You clearly have a corrupted file.

You won't be able to force zlib to ignore the corruption—and, if you did, you'd most likely get either 700MB of garbage, or some random amount of garbage, or… well, it depends on what the corruption is and where. But the chances that you could get anything useful are pretty slim.

zlib's blocks aren't random-accessable, or delimited, or even byte-aligned; it's very hard to tell when you've reached the next block unless you were able to handle the previous block.

Plus, the trees grow from block to block, so even if you could skip to the next block, your trees would be wrong, and you'd be decompressing garbage unless you get very, very lucky and don't need the broken part of the tree. Even worse, any block can restart the trees (or even switch the compressor); if you miss that, you're decompressing garbage even if you do get very lucky. And it's not just a matter of "skip this string because I don't recognize it", you don't even know how many bits long the string is if you don't recognize, so you can't skip it. Which brings us back to the first point—you can't even skip a single string, much less a whole block.

To understand this better, see RFC 1951, which describes the format used by zlib. Try manually working through a few trivial examples (just a couple strings in the first block, a couple new ones in the second block) to see how easy it is to corrupt them in a way that's hard to undo (unless you know exactly how they were corrupted). It's not impossible (after all, cracking encrypted messages isn't impossible), but I don't believe it could be fully automated, and it's not something you're likely to do for fun.

If you've got critical data (and can't just re-download it, roll back to the previous version, restore from backup, etc.), some data recovery services claim to be able to recover corrupted zlib/gz/zip files. I'm guessing this costs an arm and a leg, but it may be the right answer for the right data.

And of course I could be wrong about this not being automatable. There are a bunch of zip recovery tools out there. As far as I know, all they can do with broken zlib streams is skip that file and recover the other files… but maybe some of them have some tricks that work in some cases with broken streams.

abarnert
  • 354,177
  • 51
  • 601
  • 671
  • This makes a lot of sense - thanks for the detailed reply and link! The data were provided to us as a test case, so it should be possible to re-acquire and try again (keeping my fingers crossed!). – jscarto Mar 29 '13 at 00:54
  • After a bit of a break, I've identified the problem. It turns out that the data aren't corrupted. Instead, the compressed file consisted of multiple concatenated streams. Thus my attempts to decompress the object at once only read the first stream, producing the tiny 19kb result. I've since adjusted my code to account for this, though I've now run into a new problem - the decompression is [glacially slow](http://stackoverflow.com/questions/16506590/python-and-zlib-terribly-slow-decompressing-concatenated-streams). – jscarto May 12 '13 at 10:57
0

You need to check zlib.error to see why it stopped. Why did it stop?

Mark Adler
  • 101,978
  • 13
  • 118
  • 158