1

How do you manage chunked data with gzip encoding? I have a server which sends data in the following manner:

HTTP/1.1 200 OK\r\n
...
Transfer-Encoding: chunked\r\n
Content-Encoding: gzip\r\n
\r\n
1f50\r\n\x1f\x8b\x08\x00\x00\x00\x00\x00\x00\x03\xec}\xebr\xdb\xb8\xd2\xe0\xef\xb8\xea\xbc\x03\xa2\xcc\x17\xd9\xc7\xba\xfa\x1e\xc9r*\x93\xcbL\xf6\xcc\x9c\xcc7\xf1\x9c\xf9\xb6r\xb2.H ... L\x9aFs\xe7d\xe3\xff\x01\x00\x00\xff\xff\x03\x00H\x9c\xf6\xe93\x00\x01\x00\r\n0\r\n\r\n

I've had a few different approaches to this but there's something i'm forgetting here.

data = b''
depleted = False
while not depleted:
    depleted = True
    for fd, event in poller.poll(2.0):
        depleted = False
        if event == select.EPOLLIN:
            tmp = sock.recv(8192)
            data += zlib.decompress(tmp, 15 + 32)

Gives (also tried decoding only data after \r\n\r\n obv):
zlib.error: Error -3 while decompressing data: incorrect header check

So I figured the data should be decompressed once the data has been recieved in it's whole format..

        ...
        if event == select.EPOLLIN:
            data += sock.recv(8192)
data = zlib.decompress(data.split(b'\r\n\r\n',1)[1], 15 + 32)

Same error. Also tried decompressing data[:-7] because of the chunk ID at the very end of the data and with data[2:-7] and other various combinations, but with the same error.

I've also tried the gzip module via:

with gzip.GzipFile(fileobj=Bytes(data), 'rb') as fh:
    fh.read()

But that gives me "Not a gzipped file".

Even after recording down the data as recieved by the servers (headers + data) down into a file, and then creating a server-socket on port 80 serving the data (again, as is) to the browser it renders perfectly so the data is intact. I took this data, stripped off the headers (and nothing else) and tried gzip on the file: enter image description here

Thanks to @mark-adler I produced the following code to un-chunk the chunked data:

unchunked = b''
pos = 0
while pos <= len(data):
    chunkLen = int(binascii.hexlify(data[pos:pos+2]), 16)
    unchunked += data[pos+2:pos+2+chunkLen]
    pos += 2+len('\r\n')+chunkLen

with gzip.GzipFile(fileobj=BytesIO(data[:-7])) as fh:
    data = fh.read()

This produces OSError: CRC check failed 0x70a18ee9 != 0x5666e236 which is one step closer. In short I clip the data according to these four parts:

  • <chunk length o' X bytes> \r\n <chunk> \r\n

I'm probably getting there, but not close enough.

Footnote: Yes, the socket is far from optimal, but it looks this way because i thought i didn't get all the data from the socket so i implemented a huge timeout and a attempt at a fail-safe with depleted :)

Torxed
  • 22,866
  • 14
  • 82
  • 131

2 Answers2

3

You can't split on \r\n since the compressed data may contain, and if long enough, certainly will contain that sequence. You need to dechunk first using the length provided (e.g. the first length 1f50) and feed the resulting chunks to decompress. The compressed data starts with the \x1f\x8b.

The chunking is hex number, crlf, chunk with that many bytes, crlf, hex number, crlf, chunk, crlf, ..., last chunk (of zero length), [possibly some headers], crlf.

Mark Adler
  • 101,978
  • 13
  • 118
  • 158
  • `1f50` would be, 50 bytes? or? Because I'm guessing there will be another length-identifier in there somewhere? – Torxed Mar 05 '14 at 14:59
  • It means `1f50` bytes. In hex. The last chunk is terminated with a `\r\n` with no length. – Mark Adler Mar 05 '14 at 15:01
  • This is starting to get helpful :) haha, awesome! – Torxed Mar 05 '14 at 15:02
  • @Torxed: I updated the chunking description to be more complete. – Mark Adler Mar 05 '14 at 16:05
  • It was still a bit confusing, I don't know if it was the sentence `chunk with that many bytes` that put me off a bit, or the fact that you continued after `any bytes, crlf`, should probably have stopped there cause it got me confused with semi-finished-pseudo-syntax. Altho that's my own fault :) You'll get the win even tho i wrote a complete answer answering both the questions about the syntax, how to deal with it and what library/method to use in order to unzip the data. – Torxed Mar 05 '14 at 18:40
1

@mark-adler gave me some good pointers on how the chunked mode in the HTML protocol works, besides this i fiddled around with different ways of unzipping the data.

  1. You're supposed to stitch the chunks into one big heap
  2. You're supposed to use gzip not zlib
  3. You can only unzip the entire stitched chunks, doing it in parts will not work

Here's the solution for all three of the above problems:

unchunked = b''
pos = 0
while pos <= len(data):
    chunkNumLen = data.find(b'\r\n', pos)-pos
#   print('Chunk length found between:',(pos, pos+chunkNumLen))
    chunkLen=int(data[pos:pos+chunkNumLen], 16)
#   print('This is the chunk length:', chunkLen)
    if chunkLen == 0:
#       print('The length was 0, we have reached the end of all chunks')
        break
    chunk = data[pos+chunkNumLen+len('\r\n'):pos+chunkNumLen+len('\r\n')+chunkLen]
#   print('This is the chunk (Skipping',pos+chunkNumLen+len('\r\n'),', grabing',len(chunk),'bytes):', [data[pos+chunkNumLen+len('\r\n'):pos+chunkNumLen+len('\r\n')+chunkLen]],'...',[data[pos+chunkNumLen+len('\r\n')+chunkLen:pos+chunkNumLen+len('\r\n')+chunkLen+4]])
    unchunked += chunk
    pos += chunkNumLen+len('\r\n')+chunkLen+len('\r\n')

with gzip.GzipFile(fileobj=BytesIO(unchunked)) as fh:
    unzipped = fh.read()

return unzipped

I left the debug output in there but uncommented for a reason.
It was extremely useful even tho it looks like a mess to get what data you/i was actually trying to decompress and which parts was fetched where and which values each calculation brings fourth.

This code will walk through the chunked data with the following format:
<chunk length o' X bytes> \r\n <chunk> \r\n

Had to be careful when first of all extracting the X bytes as they came in 1f50 which i first had to use binascii.hexlify(data[0:4]) on before putting it into int(), not sure why i don't need that anymore because i needed it in order to get a length of ~8000 before but then it gave me a REALLY big number all of a sudden which was't logical even tho i didn't really give it any other data.. anyway. After that it was just a matter of making sure the numbers were correct and then combine all the chunks into one hughe pile of gzip data and feed that into .GzipFile(...).

Edit 3 years later:

I'm aware that this was a client-side problem at first, but here's a server-side function to send a some what functional test:

def http_gzip(data):
    compressed = gzip.compress(data)

    # format(49, 'x') returns `31` which is `\x31` but without the `\x` notation.
    # basically the same as `hex(49)` but ment for these kind of things.
    return bytes(format(len(compressed), 'x')),'UTF-8') + b'\r\n' + compressed + b'\r\n0\r\n\r\n'
Torxed
  • 22,866
  • 14
  • 82
  • 131
  • 1
    1. you don't need to stitch the chunks into one big heap. It might be more convenient but it is not necessary 2. [`gzip.GzipFile` is implemented on top of `zlib`](http://hg.python.org/cpython/file/2.7/Lib/gzip.py#l278) 3. you can [decompress partial content](http://stackoverflow.com/a/20624942/4279) 4. [there could be chunk extensions after the size](https://tools.ietf.org/html/rfc2616#section-3.6.1) that must be ignored if you don't understand them i.e., ignore optional `; ...` until `\r\n`. – jfs Mar 05 '14 at 19:31
  • 1
    don't use `hexlify()` here. Chunk length is a hex number e.g., `int(b'ff', 16) == 255` and in this case it is two bytes `b'f'` and `b'f'`; `ord(b'f') == 102 == 0x66` i.e., `b'f' == b'\x66'` that is why `binascii.hexlify(b'ff') == b'6666'` (note: it is ascii `b'6'` here i.e., `b'\x36'`). – jfs Mar 05 '14 at 19:36
  • You shouldn't use find, since if there is an error in the transmitted data, the find may find a `\r\n` in the compressed data. You should be looking specifically for the correct data in the correct place. You should look for hex digits. There needs to be at least one. If not, error. Look for more hex digits. If the first non-hex-digit is not `\r`, then error. If the next character isn't `\n`, error. Then you count out the bytes for the chunk. If the next byte after the chunk is not `\r`, then error. Etc. – Mark Adler Mar 05 '14 at 21:47
  • 2
    As noted already, you do _not_ need to assemble the entire stream first. You can feed the decompressor a chunk at a time, which will use a lot less memory. Make a `zlib.decompressobj` object and feed it the data. Providing `wbits` equal to `15+16` will decompress the gzip format. – Mark Adler Mar 05 '14 at 21:53
  • @MarkAdler `.find()` in this instance will only return the first occurance of `\r\n` which is equal to `\r\n`, so it's as safe as it gets. Also i need a way to get the full hex-code since it sometimes is only one byte and not two. I'll try to decompress each chunk individually.. But when trying with `zlib` at first it didn't work out so well (tried both in one big chunk, and each individual piece of chunk).. – Torxed Mar 05 '14 at 22:11
  • @J.F.Sebastian I can't explain how this *phenomenon* popped up, but all i know is the data came in `b'\xff\xff' which i couldn't input into `int()` since it expects a format much like `0xff` or even `ff`. Altho a bit confusing I do understand what you're getting at and i can't really defend my code or the output.. all i know is that when using `hexlify()` i did get the correct value (8091 if i'm not mistaken) from `1f50` (which hexlify converted into `0x1f0x50` which was exactly what i needed) :) – Torxed Mar 05 '14 at 22:14
  • It's not safe if the `\r\n` is missing due to a transmission error. – Mark Adler Mar 05 '14 at 22:20
  • @Torxed: `if b'\xff' not in string.hexdigits.encode('ascii'): raise ValueError("it is not part of the chunk length")` – jfs Mar 05 '14 at 22:21
  • @MarkAdler if `\r\n` is missing due to a transmission error over a TCP connection, i'm quite positive that i won't have any data after the hex-value, if i even get the hex-value. I'll have way more problems than `.find()` going bananza on the data if i have transmission problems :) But just to be on the safe side, i'll emulate a transmission issue pre/mid/post the appropriate part and see what happens :) – Torxed Mar 05 '14 at 22:22
  • 1
    @Torxed: `hexlify(b'1f50') == b'31663530'`, `int(b'1f50', 16) == 8016` [Reread my previous comment](http://stackoverflow.com/questions/22199661/content-encoding-gzip-transfer-encoding-chunked-with-gzip-zlib-gives-incorre/22205682#comment33714942_22205682), try the expressions in Python shell, play with it until you understand that `b'1f50' != b'\x1f\x50'` and `len(b'1f50') == 4 and len(b'\x1f\x50') == 2` – jfs Mar 06 '14 at 01:26
  • @J.F.Sebastian I am thanks to your mind boggling examples, everything is starting to fall into place! You're a champ! :) – Torxed Mar 06 '14 at 07:40