Decompressing gzip bytes from json

Question

I'm trying to work with messages that come from SQS messages.

My colleague is sending json object that has one field which is compressed with gzip using java. So basically it's gzip compressed bytes stream.

When I try to see it directly on SQS that field looks like:

"Message" : "\u001F�\b\u0000\u0000\u0000\u0000\u0000\u0000\u0000mRmo�P\u0014�/M�h2�N�~1&�43?\u0019�X��R�\u001A(\u0004J�%&��\r��\u0000\u001Bn\u000B\u0010�\u0019\u0006\u0012�8d2�zo/��[.�F퇛��<�9�9缠Ԕeh1}�2��N�\u0014<.9�\u001C�;�pO�G��\u0002�yP��~�\u0013�t�_��姹:�B,-�=\u0004\r\u001AcH\u0010!�@-Rz2��c�8��Ĉ �A>��o��\t�Kx;m��=�H�\u0006~��t\"�Ҽp6��,��\u0012q\u001F%��e%�2��c�,-3w�lzv�7��t��-Uɰ�\u0010�9Q�\u0014\u00108]\n��\u0005TU�\u0006\u001E�R$\u0012��8��e�Ե4%?��\u0007\u0007Р\t\n5l�-��?D#��\u001EՇi)]�\u0012��W��2V\u0000[�i��l�i\u0017��RZ´t��.�K��o��\u0013��|\u001F\u0013��]ż!r��MRd��F\u001C+��_��:\u0017\u00132��b\u0013��L�U19�\u0019\u0017@��~��:��(cA�\u0015\u0019^RL�&�{�r�d��\u0018�n�N\u001F\r��Y�\u0019��M���\u0010~�z;��\u001E�@o��vq��B\u0002��Q�\u0004>�G�mwo�*��\u0002M�MZ�e��M�̪\u0010\u0014S��$�7V1��ߡL�W1�y��W&{��!\u001A\u001C6��\u0003�\u001DX��\u00105�\u0000{\u0002��J�\f��sQ��\u0003xP��6�d�U�z�\u000BJ�\u0017�i\u0003\u0000\u0000",`

My code:

for message in queue.receive_messages(AttributeNames=['All']):

    message_dict = json.loads(message.body)
    compressed = message_dict['Message']
    ungziped_str = zlib.decompressobj().decompress(bytes(compressed.encode('utf-8')))

Gives:

zlib.error: Error -3 while decompressing data: incorrect header check

Any way to read the contents of it?

By the way, I've tried https://stackoverflow.com/a/12572031/536474 and still same error message.

Gzipping data and then treating the result as though it were UTF-8 is not something you're going to get away with, easily. The gzipped data should be encoded in base64. Otherwise, your colleague needs to show you a proof-of-concept that this can actually *be* decoded -- that is, proof that there isn't a lossy transformation somewhere. Not every possible byte is valid in every possible position in UTF-8, but every symbol in the base64 base64 alphabet is a valid, single byte UTF-8 character... far better solution. — Michael - sqlbot, May 03 '17 at 01:29
In fact, the `�` is a huge red flag, depending on where it's getting injected -- your side, the colleague's side, or in SQS. The first 2 octets of a gzip stream are always `0x1f` `0x8b`. We see `\u001F` but then we have `�` and the reason for that is that `0x8b` is indeed an invalid octet when the previous octet <= `0x7f` in UTF-8. It can only ever legally be preceded by another octet that is also >= `0x80`. The rules are a little complicated but that's not especially relevant -- the point is that you can't treat blobs as characters with impunity. — Michael - sqlbot, May 03 '17 at 01:40

score 1 · Accepted Answer · answered May 03 '17 at 07:34

Micheal -sqlbot was right. According to the aws sqs documentation: It supports 3 different data types:

String – Strings are Unicode with UTF-8 binary encoding. For a list of code values, see http://en.wikipedia.org/wiki/ASCII#ASCII_printable_characters.
Number – Numbers are positive or negative integers or floating point numbers. Numbers have sufficient range and precision to encompass most of the possible values that integers, floats, and doubles typically support. A number can have up to 38 digits of precision, and it can be between 10^-128 to 10^+126. Leading and trailing zeroes are trimmed.
Binary – Binary type attributes can store any binary data, for example, compressed data, encrypted data, or images.

It expects the user to input a Base-64-encoded value for sending a Binary type.

Decompressing gzip bytes from json

1 Answers1