1

I have created a decoder to essentially parse, decompress and extract a single file from a zlib encoded file downloaded through a urllib2 file-like object. The idea is to utilize as little memory and disk space as possible, so I am using a reader / writer pattern with the "decoder" in the middle to uncompress the data coming from urllib2, feed it into a cpio subprocess and finally write the file data to disk:

with closing(builder.open()) as reader:
    with open(component, "w+b") as writer:
         decoder = Decoder()

         while True:
             data = reader.read(10240)
             if len(data) == 0:
                 break

             writer.write(decoder.decode(data))

        final = decoder.flush()
        if final is not None:
            writer.write(final)

        writer.flush()

The decoder is pretty simple too:

class Decoder(object):
    def __init__(self):
        self.__zcat = zlib.decompressobj()
        # cpio initialisation

    def decode(self, data_in):
        return self.__consume(self.__zcat.decompress(data_in))

    def __consume(self, zcat_data_in):
        # cpio operations
        return data_out

    def flush(self):
        return self.__consume(self.__zcat.flush())

I am seeing an error before anything is even passed to the cpio pipe, so I felt omitting it here was sensible for clarity.

The interesting thing, is that to verify the data could in fact be uncompressed by zlib, I wrote the raw data data_in being passed to decode() to stdout:

def decode(self, data_in):
    sys.stdout.write(data_in)
    return self.__consume(self.__zcat.decompress(data_in))

Then ran:

$ bin/myprog.py 2>/dev/null | zcat - | file -
/dev/stdin: cpio archive

As you can see, zcat was quite happy about the data it was given on stdin and the resultant file is a cpio archive. But the zlib decompress method is reporting:

error: Error -3 while decompressing: incorrect header check
Craig
  • 4,268
  • 4
  • 36
  • 53
  • 1
    [This answer](http://stackoverflow.com/a/22310760/736937) might be worth reviewing. – jedwards Mar 16 '15 at 22:21
  • I thought it had solved the problem for a second, but I introduced a semantic error. However, having tried a few of the window bit combinations, I can only muster a slightly different error `error: Error -3 while decompressing: invalid block type` – Craig Mar 16 '15 at 22:36
  • I was looking at [this souirce code](https://hg.python.org/cpython/file/2.7/Lib/gzip.py) for the gzip library. It looks like that it is reading in the gzip header before it calls decompress. This might be the key to getting it to work. I would use this library, but it looks unsafe since it seeks on the input file object, which will not be possible on the the urllib2 stream. But I'll explore reading off the gzip header as the data is provided to my method. – Craig Mar 16 '15 at 22:49
  • Interestingly, the first two bytes examined by the gzip.py implementation, expects `\x1f\x8b`. However, the header of my file reports `\x1f\x9d`. – Craig Mar 16 '15 at 22:59
  • That [magic suggests](http://www.garykessler.net/library/file_sigs.html) ([another ref](http://en.wikipedia.org/wiki/List_of_file_signatures)) its a Tar.Z -- you might consider [tarfile](https://docs.python.org/2/library/tarfile.html) – jedwards Mar 16 '15 at 23:05
  • Also, [this](http://stackoverflow.com/questions/20762094/how-are-zlib-gzip-and-zip-related-what-are-is-common-and-how-are-they-differen) seems like a good read. – jedwards Mar 16 '15 at 23:09
  • It's not a tar file, it's a cpio archive compressed using the `compress` utility. There are legacy reasons for it being compressed in that way and it's unlikely to change. Given that `gzip` happily inflates `compress`d files, I figured it was zlib under the hood. However, I suspect gzip has been statically linked, since I can't see any relevant dynamic linkage through ldd. I guess my alternative is to use another Popen subprocess to send the data through `gzip -cd` – Craig Mar 16 '15 at 23:21

1 Answers1

1

\x1f\x9d are the first two bytes of the old Unix compress format. zlib can't help you decompress it. gzip can decompress it just to be compatible with the old compress utility.

You can pull the code from pigz for decompressing that format and use it directly.

Mark Adler
  • 101,978
  • 13
  • 118
  • 158