2

I am encountering issues ungzipping chunks of bytes that I am reading from S3 using the iter_chunks() method from boto3. The strategy of ungzipping the file chunk-by-chunk originates from this issue.

The code is as follows:

dec = zlib.decompressobj(32 + zlib.MAX_WBITS)
for chunk in app.s3_client.get_object(Bucket=bucket, Key=key)["Body"].iter_chunks(2 ** 19):
    data = dec.decompress(chunk)
    print(len(chunk), len(data))

# 524288 65505
# 524288 0
# 524288 0
# ...

This code initially prints out the value of 65505 followed thereafter by 0 for every subsequent iteration. My understanding is that this code should ungzip each compressed chunk, and then print the length of the uncompressed version.

Is there something I'm missing?

WillJones
  • 907
  • 1
  • 9
  • 19
  • Can you confirm that the compressed chunk has a non-zero length? i.e. `print(len(chunk), len(data))` And clarify if its gzip or or zlib ? – afaulconbridge Apr 06 '20 at 09:37

1 Answers1

3

It seems like your input file is block gzip (bgzip http://www.htslib.org/doc/bgzip.html ) because you have a 65k block of data decoded.

GZip files can be concatenated together ( see https://www.gnu.org/software/gzip/manual/gzip.html#Advanced-usage) and Block GZip uses this to concatenate blocks of the same file, so that by using an associated index only the specific block containing information of interest has to be decoded.

So to stream decode a block gzip file, you need to use the leftover data from one block to start a new one. E.g.

# source is a block gzip file see http://www.htslib.org/doc/bgzip.html
dec = zlib.decompressobj(32+zlib.MAX_WBITS)
for chunk in raw:
    # decompress this chunk of data
    data = dec.decompress(chunk)
    # bgzip is a concatenation of gzip files
    # if there is stuff in this chunk beyond the current block
    # it needs to be processed
    while len(dec.unused_data):
        # end of one block
        leftovers = dec.unused_data
        # create a new decompressor
        dec = zlib.decompressobj(32+zlib.MAX_WBITS)
        #decompress the leftovers
        data = data+dec.decompress(leftovers)
    # TODO handle data
afaulconbridge
  • 1,107
  • 9
  • 21
  • If indeed eech block is a separate GZip file, you want to use `data += dec.flush()`. `dec.flush()` returns the decompressed data from the `.unused_data` bytes object, without the need for a separate decompression object. Do move the `dec = zlib ...` line into the `for` loop in that case. – Martijn Pieters Jun 17 '23 at 19:09