1

The code that does the compression in javascript using pako(https://github.com/nodeca/pako)Pako. It compresses string 't'

var compressedString = pako.gzip('t', {level: 4, to: 'string'}));
$.ajax('/decompress', {string: compressedString})

The code at /decompress that does the decompression

from cgi import parse_qs, escape
import json
import zlib
def application(environ, start_response):
    status = '200 OK'
    try:
        request_body_size = int(environ.get('CONTENT_LENGTH', 0))
    except (ValueError):
        request_body_size = 0
    request_body = environ['wsgi.input'].read(request_body_size)
    d = parse_qs(request_body)

    response_headers = [('Content-type', 'text/plain')]
    start_response(status, response_headers)
    inputString = d.get('string')[0]
    # Use same wbits(=31) as used by pako
    decompressed = zlib.decompress(inputString, 31);
    return 'done'

Doing the decompression throws following error. The error occurs for zlib.decompress line.

error: Error -3 while decompressing data: incorrect header check

I also tried encoding the inputString(

inputString.encode('utf-8')

) but it also throws the error.

Hariom Balhara
  • 832
  • 8
  • 19

1 Answers1

5
to: 'string'

This option smuggles the output byte sequence into a JS (Unicode) String, by mapping each byte to the character with the same number. (This is equivalent to decoding using the ISO-8859-1 encoding.)

$.ajax('/decompress', {string: compressedString})

XMLHttpRequest needs to encode the (Unicode) string value back to a byte sequence to go (URL-encoded) over the network. The encoding it uses is UTF-8, not ISO-8859-1, so the sequence of bytes on the network won't be the same sequence of bytes that came out of the GZip compressor.

You can undo this process at the Python end by re-encoding after the URL-decode step:

d = parse_qs(request_body).decode('utf-8').encode('iso-8859-1')

Now you should have the same sequence of bytes that came out of the compressor.

Sending bytes as UTF-8-encoded codepoints, and URL-encoding the non-ASCII bytes out of that, will together bloat the network traffic to about four times as much as the raw bytes would take up, which rather undoes the good work of the compression.

If you just post the data string on its own as a request body to the Python script, you could lose the URL-encoding and then your request would be only(!) about 50% more than the raw compressed data. To do any better than that you would need to start looking at sending the raw bytes directly as a ByteArray, or perhaps using multipart form-data. Either way there are browser compatibility problems.

bobince
  • 528,062
  • 107
  • 651
  • 834
  • I had no idea about iso-8859-1 encoding. I literally spent days to fix this issue. Thanks a lot :) – Hariom Balhara Dec 06 '16 at 06:00
  • I will look into the problem you mentioned of sending 3 bytes due to utf-8 encoding(which defeats the purpose to some extent of compression). The problem currently is that I need to send hybrid data. Some values are non binary and some are binary data. So, I can't set multipart form-data header directly. – Hariom Balhara Dec 06 '16 at 06:09
  • First step might be to try base64, which is only 33% larger than raw. You get base64 as [`atob()`](https://developer.mozilla.org/en-US/docs/Web/API/WindowBase64/atob) in most browser, but—again!—you would need fallback for IE<10. (At least that one is easy to polyfill.) – bobince Dec 06 '16 at 21:47
  • Yeah I tried base64 which works fine but again undoes partial compression work. I am sending binary data now with an option to fallback to base64 for old browsers. – Hariom Balhara Dec 07 '16 at 04:42