1

So I am generating a URL in python for a GET request (it has to be a GET request) and one of my query string parameters is EXTREMELY long (~900 chars) Is there any way I can compress this string and place it in the url? I have tried zlib but that uses bytes and the url needs to be a string. Basically is there any way to do this?

# On server
x = '900_char_string'
compressed_string = compress(x)
return 'http://whatever?querystring_var=' + compressed_string
# ^ return value is what client requests by clicking link with that url or whatever
# On client
# GET http://whatever?querystring_var=randomcompressedchars<900
# Server receiving request
value = request['querystring_var']
y = decompress(value)
print(y)
>>> 900_char_string  # at this point server can work with the uncompressed string


Rosey
  • 739
  • 1
  • 12
  • 27
  • Cant you just decode the bytes object from zlib.compress(querystring).decode("utf-8") – James Bingo Sep 10 '19 at 21:13
  • 2
    I'm unclear on the problem. You have a sequence of 900*8 bits that you want to compress, and later decompress without information loss. What redundancy exists in the 7200 bits that you believe you can squeeze out? All we know is that the string is generated URL, which has a rather [broad character set](https://stackoverflow.com/questions/1856785/characters-allowed-in-a-url#1856809). Unless you know of some redundancy, you're going to get less than a 25% compression (i.e. down to maybe 700 bytes). – Prune Sep 10 '19 at 21:14
  • Just use gzip, it'll compress all request including headers and will be significantly more effective. – Olvin Roght Sep 10 '19 at 21:21
  • I can't decode the zlib bytes with utf8 because zlib is effectively encoding the string in it's much more space efficient encoding, not utf8. gzip also outputs a bytes string. I can't append bytes to the url string. @Prune I am not expecting a major reduction in size, I will take any I can get. Using base64 actually INCREASED the string size so maybe the lack of redundancy in the initial string means compression is not an option? The original string is an API key so very random looking: ENlrXcoKRqXYC16vE/aXXC553GcOA5fmFlKRNogNYJdAZNtjKfEV0ZzJem1o7K4D2RiN96HA2fWpqUypCpZQlKtyKKm... – Rosey Sep 12 '19 at 15:24

1 Answers1

0

The issue is now fairly clear. I think we need to examine this from a standpoint of information theory.

  • The input is a string of visible characters, currently represented in 8 bits each.
  • The "alphabet" for this string is alphanumeric (26+26+10 symbols), plus about 20 special and reserved characters, 80+ characters total.
  • There is no apparent redundancy in the generated string.

There are three main avenues to shortening a representation, taking advantage of

  • Frequency of characters (hamming): replace a frequent character with fewer than 8 bits; longer bit strings will then be needed for rare characters.
  • Frequency of substrings (compression): replace a frequent substring with a single character.
  • Convert to a different base: ideally, len(alphabet).

The first two methods can lengthen the resulting string, as they require starting with a translation table. Also, since your strings appear to be taken from a uniform random distribution, there will be no redundancy or commonality to leverage. When the Shannon entropy is at or near the maximum over the input tokens, there is nothing to be gained in those methods.

This leaves us base conversion. We're using 8 bits -- 256 combinations -- to represent an alphabet of only 82 characters. A simple base conversion will save about 20%; the ratio is log(82) / log(256). If you want a cheap conversion, simply map into a 7-bit representation, a saving of 12.5%

Very simply, define a symbol ordinality on your character set, such as

0123456789ABCDEFGH...YZabcd...yz:/?#[]()@!$%&'*+,;=%   (81 chars)

Now, compute the numerical equivalent of a given string, just as if you were hand-coding a conversion from a decimal or hex string. The resulting large integer is the compressed value. Write it out in bytes, or chop it into 32-bit integers, or whatever fits your intermediate storage medium.

Prune
  • 76,765
  • 14
  • 60
  • 81