1

I've been supplied with a zipped file containing multiple individual streams of compressed XML. The compressed file is 833 mb.

If I try to decompress it as a single object, I only get the first stream (about 19 kb).

I've modified the following code supplied as a answer to an older question to decompress each stream and write it to a file:

import zlib

outfile = open('output.xml', 'w')

def zipstreams(filename):
    """Return all zip streams and their positions in file."""
    with open(filename, 'rb') as fh:
        data = fh.read()
    i = 0
    print "got it"
    while i < len(data):
        try:
            zo = zlib.decompressobj()
            dat =zo.decompress(data[i:])
            outfile.write(dat)
            zo.flush()
            i += len(data[i:]) - len(zo.unused_data)
        except zlib.error:
            i += 1
    outfile.close()

zipstreams('payload')
infile.close()

This code runs and produces the desired result (all the XML data decompressed to a single file). The problem is that it takes several days to work!

Even though there are tens of thousands of streams in the compressed file, it still seems like this should be a much faster process. Roughly 8 days to decompress 833mb (estimated 3gb raw) suggests that I'm doing something very wrong.

Is there another way to do this more efficiently, or is the slow speed the result of a read-decompress-write---repeat bottleneck that I'm stuck with?

Thanks for any pointers or suggestions you have!

Community
  • 1
  • 1
jscarto
  • 25
  • 1
  • 7
  • 2
    You did see that the source answer says "It's good enough for small files and to figure out the format"? It's a really inefficient way to do it; among other things, you're making tens of thousands of really large substrings of `data`. Find a higher-level way to process each stream. – alexis May 12 '13 at 11:06
  • It [sounds](http://stackoverflow.com/q/15694270/699305) like you're just setting up this process with the suppliers of the data; can't you just ask them to send you the data collection as a single zipped file, e.g., as a gzipped tar archive of a folder containing the individual streams? They will get _dramatically_ better compression, so it's worth everyone's time to make the change. – alexis May 12 '13 at 11:15
  • My guess is that the problem is that, for skipping over headers or whatever else is between the streams, you're doing `i += 1` and trying again. So, along with decompressing 10000 files, you're failing to decompress 990000 files, or something like that. To see whether this is true, try logging something each time you get that `zlib.error`. – abarnert May 12 '13 at 11:16
  • And if that _is_ the problem, in order to solve it, print out the offsets and lengths of each run of errors, then look at the file in a hex editor to see what's there. Ideally, there will be something you can parse to get lengths or delimiters, so you can break the file up into a bunch of separate zlib streams and just decompress those properly. – abarnert May 12 '13 at 11:18
  • Following up on alexis's point, even if you can't get them to send you the data as a zip/tgz/whatever, maybe you can at least get them to describe the format they're using, instead of trying to reverse-engineer it? – abarnert May 12 '13 at 11:19
  • @alexis This is how the data are supplied. We've tried to get them in other formats but from the supplier's POV it was "take it leave it," unfortunately. – jscarto May 12 '13 at 11:22
  • @abarnert The whole file had a single header of 32 bytes. I've removed it from the compressed object prior to running this code on it. Would it be more efficient to create a list of each stream and its position, and then iterate over that list to do the decompression? – jscarto May 12 '13 at 11:22
  • Do you have an upper bound on the length of the individual documents? Extracting `data[i:i+max]` at each step should be much faster than using `data[i:]`. – alexis May 12 '13 at 11:26
  • Actually, the big question is: Are your streams encapsulated, or do you really have raw concatenated zlib streams? I'm guessing there's some sort of encapsulation (gzip, if you're lucky), which should allow you to detect the start and end of each stream without any decompression. – alexis May 12 '13 at 11:29
  • I believe they are encapsulated. The vendor describes them as "using the standard zlib compression library.These are the same compression routines on which the gzip utilities are built." If this is indeed the case, would the idea to make a list of stream positions and decompress each position then be the route to go? – jscarto May 12 '13 at 11:33
  • Additional note: I've tried to treat the compressed object as a single encapsulated file with the standard gzip tool but it does not recognize the format (which is strange considering the vendor's description). – jscarto May 12 '13 at 11:38
  • @unutbu, I'm afraid you're confusing the ZIP _library_ format with the raw `zlib` encoding. See http://tools.ietf.org/html/rfc1950 for the correct specs. – alexis May 12 '13 at 12:05
  • @jscarto: The question is whether, in addition to the 32-byte header at the start of the file, there's _also_ a header at the start of each (subsequent) stream. If so, you can use that to make things more efficient. More importantly, if you _can_ "create a list of each stream and its position", yes, you definitely should… but without some kind of per-stream header, I have no idea how you could do that (without doing all the work you're currently doing). – abarnert May 12 '13 at 12:11
  • @jscarto: Also, can you ask the vendor any questions, or is that one-liner description all the info you can get? Because, again, it's easier to parse a format someone describes for you, than to reverse engineer the format from examples. Also, have you tried logging the offset of each `zlib.error` as I asked, to see if my guess is in the right direction? – abarnert May 12 '13 at 12:12
  • @abarnert A log of the zlib.error reveals many `Error -3 while decompressing: incorrect header check` and occasional `Error -3 while decompressing: unknown compression method` and `Error 2 while decompressing` statements. The vendor supplied a detailed PDF explaining the data, but the format we received is atypical. The normal routine would be decompressing a single zip file consisting of a header+payload, where the payload is a much smaller packet (for real-time or continuous queries). They've given us 24 hours' worth of data by simply concatenating many of these individual streams. – jscarto May 12 '13 at 12:38

3 Answers3

4

It's hard to say very much without more specific knowledge of the file format you're actually dealing with, but it's clear that your algorithm's handling of substrings is quadratic-- not a good thing when you've got tens of thousands of them. So let's see what we know:

You say that the vendor states that they are

using the standard zlib compression library.These are the same compression routines on which the gzip utilities are built.

From this we can conclude that the component streams are in raw zlib format, and are not encapsulated in a gzip wrapper (or a PKZIP archive, or whatever). The authoritative documentation on the ZLIB format is here: https://www.rfc-editor.org/rfc/rfc1950

So let's assume that your file is exactly as you describe: A 32-byte header, followed by raw ZLIB streams concatenated together, without any other stuff in between. (Edit: That's not the case, after all).

Python's zlib documentation provides a Decompress class that is actually pretty well suited to churning through your file. It includes an attribute unused_data whose documentation states clearly that:

The only way to determine where a string of compressed data ends is by actually decompressing it. This means that when compressed data is contained part of a larger file, you can only find the end of it by reading data and feeding it followed by some non-empty string into a decompression object’s decompress() method until the unused_data attribute is no longer the empty string.

So, this is what you can do: Write a loop that reads through data, say, one block at a time (no need to even read the entire 800MB file into memory). Push each block to the Decompress object, and check the unused_data attribute. When it becomes non-empty, you've got a complete object. Write it to disk, create a new decompress object and initialize iw with the unused_data from the last one. This just might work (untested, so check for correctness).

Edit: Since you do have other data in your data stream, I've added a routine that aligns to the next ZLIB start. You'll need to find and fill in the two-byte sequence that identifies a ZLIB stream in your data. (Feel free to use your old code to discover it.) While there's no fixed ZLIB header in general, it should be the same for each stream since it consists of protocol options and flags, which are presumably the same for the entire run.

import zlib

# FILL IN: ZHEAD is two bytes with the actual ZLIB settings in the input
ZHEAD = CMF+FLG  
    
def findstart(header, buf, source):
    """Find `header` in str `buf`, reading more from `source` if necessary"""

    while buf.find(header) == -1:
        more = source.read(2**12)
        if len(more) == 0:  # EOF without finding the header
            return ''
        buf += more
        
    offset = buf.find(header)
    return buf[offset:]

You can then advance to the start of the next stream. I've added a try/except pair since the same byte sequence might occur outside a stream:

source = open(datafile, 'rb')
skip_ = source.read(32) # Skip non-zlib header

buf = ''
while True:
    decomp = zlib.decompressobj()
    # Find the start of the next stream
    buf = findstart(ZHEAD, buf, source)
    try:    
        stream = decomp.decompress(buf)
    except zlib.error:
        print "Spurious match(?) at output offset %d." % outfile.tell(),
        print "Skipping 2 bytes"
        buf = buf[2:]
        continue
    
    # Read until zlib decides it's seen a complete file
    while decomp.unused_data == '':
        block = source.read(2**12)
        if len(block) > 0:       
            stream += decomp.decompress(block)
        else:
            break # We've reached EOF
        
    outfile.write(stream)
    buf = decomp.unused_data # Save for the next stream
    if len(block) == 0:
        break  # EOF

outfile.close()

PS 1. If I were you I'd write each XML stream into a separate file.

PS 2. You can test whatever you do on the first MB of your file, till you get adequate performance.

Community
  • 1
  • 1
alexis
  • 48,685
  • 16
  • 101
  • 161
  • +1. Note that you'll want to break out of *both* `while` loops when we reach EOF. – unutbu May 12 '13 at 12:54
  • Thanks for great explanation of what this does/why it should work. I really appreciate it! I've tried using a similar approach, going through the file in chunks, but with both of our methods, the script fails and won't run due to an `incorrect header check.` – jscarto May 12 '13 at 12:56
  • Sorry to hear that, you should have given a more precise description of the actual format of your input. Then you need to combine this approach with detecting the start of the next zlib stream. Given they way the data were generated, they almost certainly start with the same CMF/FLG header (see the zlib doc), so you can search for that and try there (but you can't rule out false matches). – alexis May 12 '13 at 13:02
  • You don't need to assume a fixed zlib header. You can check for two bytes that meet the RFC 1950 header specification, which includes a 5-bit checksum. You can check that the two bytes taken as a big-endian integer is a multiple of 31, and also check that the low four bits of the first byte is 8. If that all checks out, you still need to run the decompressor on it to make sure that it's not a false positive. Far better would be to figure out the format. It's probably not very hard. – Mark Adler May 13 '13 at 00:05
  • This has gotten me super close (the answer is 100% correct; not getting it running at optimally has to do with my own unfamiliarity with zlib headers). After looking at my data in a hex editor as @abarnert suggested, the 2 bytes differ for each stream. So I think if I modify it a bit to check for two bytes that meet the RFC 1950 spec it should go through pretty quickly. It's already much faster than anything I had before by a factor of at least 10. I'm going to mark this as the correct answer as it contains the proper procedure and a wonderful description. Thank you so much for your help! – jscarto May 13 '13 at 08:27
  • Going from eight days down to a minute (which is where you should be) is more than four orders of magnitude. So you have another three orders of magnitude to go. – Mark Adler May 14 '13 at 19:11
  • There's mystery meat between the streams, so clearly they're not just concatenated together. But we don't even know if it's four bytes or four KB. He's got to either put up some sample data for us, or solve the problem himself, or just let the program churn overnight and move on to something else... – alexis May 14 '13 at 19:39
1

Decompressing 833 MB should take about 30 seconds on a modern processor (e.g. a 2 GHz i7). So yes, you are doing something very wrong. Attempting to decompress at every byte offset to see if you get an error is part of the problem, though not all of it. There are better ways to find the compressed data. Ideally you should find or figure out the format. Alternatively, you can search for valid zlib headers using the RFC 1950 specification, though you may get false positives.

More significant may be that you are reading the entire 833 MB into memory at once, and decompressing the 3 GB to memory, possibly in large pieces each time. How much memory does your machine have? You may be thrashing to virtual memory.

If the code you show works, then the data is not zipped. zip is a specific file format, normally with the .zip extension, that encapsulates raw deflate data within a structure of local and central directory information intended to reconstruct a directory in a file system. You must have something rather different, since your code is looking for and apparently finding zlib streams. What is the format you have? Where did you get it? How is it documented? Can you provide a dump of, say, the first 100 bytes?

The way this should be done is not to read the whole thing into memory and decompress entire streams at once, also into memory. Instead, make use of the zlib.decompressobj interface which allows you provide a piece at a time, and get the resulting available decompressed data. You can read the input file in much smaller pieces, find the decompressed data streams by using the documented format or looking for zlib (RFC 1950 headers), and then running those a chunk at a time through the decompressed object, writing out the decompressed data where you want it. decomp.unused_data can be used to detect the end of the compressed stream (as in the example you found).

Community
  • 1
  • 1
Mark Adler
  • 101,978
  • 13
  • 118
  • 158
0

From what you've described in the comments, it sounds like they're concatenating together the individual files they would have sent you separately. Which means each one has a 32-byte header you need to skip.

If you don't skip those headers, it would probably have exactly the behavior you described: If you get lucky, you'll get 32 invalid-header errors and then successfully parse the next stream. If you get unlucky, the 32 bytes of garbage will look like the start of a real stream, and you'll waste a whole lot of time parsing some arbitrary number of bytes until you finally get a decoding error. (If you get really unlucky, it'll actually decode successfully, giving you a giant hunk of garbage and eating up one or more subsequent streams.)

So, try just skipping 32 bytes after each stream finishes.

Or, if you have a more reliable way of detecting the start of the next stream (this is why I told you to print out the offsets and look at the data in a hex editor, and while alexis told you to look at the zlib spec), do that instead.

abarnert
  • 354,177
  • 51
  • 601
  • 671