36

I doubt if this is encryption but I can't find a better phrase. I need to pass a long query string like this:

http://test.com/test.php?key=[some_very_loooooooooooooooooooooooong_query_string]

The query string contains NO sensitive information so I'm not really concerned about security in this case. It's just...well, too long and ugly. Is there a library function that can let me encode/encrypt/compress the query string into something similar to the result of a md5() (similar as in, always a 32 character string), but decode/decrypt/decompress-able?

jodeci
  • 966
  • 2
  • 11
  • 18
  • You already named it: "compression" would be a more appropriate title maybe ;) Or why not send the data via POST? – Felix Kling Jun 08 '10 at 09:21
  • Or store it in the SESSION... there are so many ways, but he wants to store it in the uri. Its not a bad idea :P – therufa Jun 08 '10 at 09:25
  • 1
    Note that a GET string should never exceed 1-2 kilobytes in size due to server and browser limitations. – Pekka Jun 08 '10 at 09:25
  • is it for same server or for some other one? – Your Common Sense Jun 08 '10 at 09:30
  • POST would be nice if there was a
    to work with. In this case it's just a plain URL, which hopefully I won't have to bloat up to a
    just to get this to work! SESSION was actually my first choice but unfortunately I need to deal with multiple instances so that did not work out either. Afraid I'm stuck with the URI.
    – jodeci Jun 08 '10 at 09:46
  • 1
    @jodeci you shouldn't have a problem with multiple instances if you give each query string a unique, random identifier. – Pekka Jun 08 '10 at 09:51
  • @Pekka Ah...will definitely try that! – jodeci Jun 08 '10 at 09:57
  • At least @Pekka you came to **real** answer :) – Your Common Sense Jun 08 '10 at 10:09
  • @Col you will notice I gave that answer already an hour ago below :P but it won't necessarily be the best option, it won't work if it's an external link (= no session present). You would then have to work with a file/database based approach instead of sessions. – Pekka Jun 08 '10 at 10:11

9 Answers9

51

You could try a combination of gzdeflate (raw deflate format) to compress your data and base64_encode to use only those characters that are allowed without Percent-encoding (additionally exchange the characters + and / by - and _):

$output = rtrim(strtr(base64_encode(gzdeflate($input, 9)), '+/', '-_'), '=');

And the reverse:

$output = gzinflate(base64_decode(strtr($input, '-_', '+/')));

Here is an example:

$input = 'Lorem ipsum dolor sit amet, consectetur adipisicing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum.';

// percent-encoding on plain text
var_dump(urlencode($input));

// deflated input
$output = rtrim(strtr(base64_encode(gzdeflate($input, 9)), '+/', '-_'), '=');
var_dump($output);

The savings in this case is about 23%. But the actual efficiency of this compression precedure depends on the data you are using.

Dave Jarvis
  • 30,436
  • 41
  • 178
  • 315
Gumbo
  • 643,351
  • 109
  • 780
  • 844
39

The basic premise is very difficult. Transporting any value in the URL means you're restricted to a subset of ASCII characters. Using any sort of compression like gzcompress would reduce the size of the string, but result in a binary blob. That binary blob can't be transported in the URL though, since it would produce invalid characters. To transport that binary blob using a subset of ASCII you need to encode it in some way and turn it into ASCII characters.

So, you'd turn ASCII characters into something else which you'd then turn into ASCII characters.

But actually, most of the time the ASCII characters you start out with are already the optimal length. Here a quick test:

$str = 'Hello I am a very very very very long search string';
echo $str . "\n";
echo base64_encode(gzcompress($str, 9)) . "\n";
echo bin2hex(gzcompress($str, 9)) . "\n";
echo urlencode(gzcompress($str, 9)) . "\n";

Hello I am a very very very very long search string
eNrzSM3JyVfwVEjMVUhUKEstqkQncvLz0hWKUxOLkjMUikuKMvPSAc+AEoI=
78daf348cdc9c957f05448cc554854284b2daa442772f2f3d2158a53138b9233148a4b8a32f3d201cf801282
x%DA%F3H%CD%C9%C9W%F0TH%CCUHT%28K-%AAD%27r%F2%F3%D2%15%8AS%13%8B%923%14%8AK%8A2%F3%D2%01%CF%80%12%82

As you can see, the original string is the shortest. Among the encoded compressions, base64 is the shortest since it uses the largest alphabet to represent the binary data. It's still longer than the original though.

For some very specific combination of characters with some very specific compression algorithm that compresses to ASCII representable data it may be possible to achieve some compression, but that's rather theoretical. Update: Actually, that sounds too negative. The thing is you need to figure out if compression makes sense for your use case. Different data compresses differently and different encoding algorithms work differently. Also, longer strings may achieve a better compression ratio. There's probably a sweet spot somewhere where some compression can be achieved. You need to figure out if you're in that sweet spot most of the time or not.

Something like md5 is unsuitable since md5 is a hash, which means it's non-reversible. You can't get the original value back from it.

I'm afraid you can only send the parameter via POST, if it doesn't work in the URL.

deceze
  • 510,633
  • 85
  • 743
  • 889
  • 4
    Actually you cannot say for sure that any encoding will result in a longer string as the original one. It really depends on the original string. If the compression is good enough then even encoded versions can be shorter. I tried my example with base64 and it is shorter then the original string. But you are right in a way, because probably the most time, the encoded versions will be longer. – Felix Kling Jun 08 '10 at 10:06
  • @Felix Yes, I just added that same thought to my answer. It's not impossible, but very impractical. – deceze Jun 08 '10 at 10:08
  • @gumbo True, but that really wouldn't do very much. – deceze Jun 08 '10 at 10:42
  • @deceze: It would extend the string by 20 characters, making it longer than the Base 64 encoded one. – Gumbo Jun 08 '10 at 10:58
  • @gumbo Not if the spaces are only replaced with a `+`. :) – deceze Jun 08 '10 at 11:00
  • Longer query strings will tend to have more redundancy, and will therefore tend to fare better under the base64/gzcompress solution. – ladenedge May 10 '12 at 21:45
  • Yep, you tried with 50 char string, what about a 300 or 700 or 1500 chars? i would be very courious to see the results there, especially with moslty humam readable text... – MBoros Feb 05 '14 at 17:49
  • @MBoros How about 15 million characters? That's called the argument ad absurdum. And deceze's answer holds while others cannot. –  Jun 15 '14 at 03:43
  • As far as I know, query strings go up to about 2000 characters. 15 million is a different problem :) But withing that 2000 chars in some cases you might want to encode stuff (e.g data in a qr code that points to a URL) – MBoros Jun 16 '14 at 05:21
14

This works great for me:

$out = urlencode(base64_encode(gzcompress($in)));

Saves a lot.

$in = 'Hello I am a very very very very long search string' // (51)
$out = 64

$in = 500
$out = 328

$in = 1000
$out = 342

$in = 1500
$out = 352

So the longer the string, the better compression. The compression parameter, doesn't seem to have any effect.

plannapus
  • 18,529
  • 4
  • 72
  • 94
Akyhne
  • 179
  • 2
  • 9
  • I tested `urlencode(base64_encode(gzcompress($in)));` with an array `implode`d to a string. I started seeing compression at strings longer than about 80 characters. e.g. from 110 to 94. Your $in/$out variables are not very well explained in your answer. I think you're saying those are the input/output sizes of your strings (before/after compression). Here's my answer that uses yours as a starting point: http://stackoverflow.com/a/20915918/631764 – Buttle Butkus Jan 04 '14 at 04:46
  • If you are creating the longer input strings by repeating the same chunk of text you will get really good compression, since there's repeating patterns (low entropy) in the input. If you compress a string with less repetition (higher entropy) you will get less compression. Long story short: you need to ensure you benchmark with real(istic) data. – Henry Feb 13 '17 at 00:13
5

Update:
gzcompress() won't help you. For example if you take Pekka's answer:

String length: 640
Compressed string length: 375
URL encoded string length: 925
(with base64_encode, it is only 500 characters ;) )

So this way (passing the data via the URL) is probably not the best way...

If you don't exceed the URLs limits with the string, why do you care about how the string looks like? I assume it gets created, sent and processed automatically anyway, doesn't it?

But if you want to use it as e.g. some kind of confirmation link in an email, you have to think about something short and easy to type for the user anyway. You could, e.g. store all the needed data in a database and create some kind of token.


Maybe gzcompress() can help you. But this will result in not allowed characters, so you will have to use urlencode() too (which makes the string longer and ugly again ;) ).

Felix Kling
  • 795,719
  • 175
  • 1,089
  • 1,143
  • Yup. The result will need to be `urlencode()`d which makes it probably impractical for very small strings, but may work for larger ones. – Pekka Jun 08 '10 at 09:25
  • @Pekka: Yes just noticed it. It does not really seem to have an effect on short strings. – Felix Kling Jun 08 '10 at 09:29
  • Felix I know you are in Facebook, about compressing string in Facebook. You know there is UUID in Cassandra (db) and it is a long text! Twitter and Instagram are also using Cassandra (and Cassandra was initially developed at Facebook) and when I looked up Facebook, Twitter and Instagram posts or users' id, that was not UUID and I am not sure if they are using UUID (or timeuuid) in their URLs too or they have a kind of function and algorithm to reduce and length of the UUID. **Do you know and can you say what Facebook is doing that has short URLs if using UUID?** – Mohammad Kermani Jun 06 '16 at 21:01
2

Basically, it's like they say: Compress text, and send it coded in a usefully way. But:

1) Common compression methods are heavier than text because of dictionaries. If the data is always an undetermined order of determined chunks of data (like in a text are words or syllabes[3], and numbers and some symbols) you could use always the same static dictionary, and don't send it (don't paste it on the URL). Then you can save the space of the dictionary.

1.a) If you are already sending the language (or if it's always the same), you could generate a dictionary per lang.

1.b) Take advantage of the format limitations. If you known it's a number, you can code it directly (see 3). If you known it's a date, you could coded as Unix-time[1] (seconds since 01/01/1970), so "21/05/2013 23:45:18" turns "519C070E" (hex); if it's a date of the year, you could coded as days since new year including 29/02 (25/08 would be 237).

1.3) You known emails has to follow certain rules, and usually are from the same few servers (gmail, yahoo, etc.) You could take advantages of that to compress it with your own simple method:

samplemail1@gmail.com,samplemail2@yahoo.com.ar,samplemail3@idontknowyou.com => samplemail1:1,samplemail2:5,samplemail3@idontknowyou:1

2) If the data follows patterns, you can use that to help compression. For example, if always follows this patter:

name=[TEXT 1]&phone=[PHONE]&mail=[MAIL]&desc=[TEXT 2]&create=[DATE 1]&modified=[DATE 2]&first=[NUMBER 1]&last=[NUMBER 2]

You could: 2.a) Ignore the similar text, and compress just the variable text. Like:

[TEXT1]|[PHONE]|[MAIL]|[TEXT 2]|[DATE 1]|[DATE 2]|[NUMBER 1][NUMBER 2]

2.b) Encode or compress data by type (encode numbers using base64[2] or similar). Like at 1). This allows you even to supress separators. Like:

[DATE 1][DATE 2][NUMBER 1][NUMBER 2][PHONE][MAIL]|[TEXT 1]|[TEXT 2]

3) Coding:

3.a) While it is true that if we compress coding with characters not supported by HTTP, they will be transformed into a more heavy ones (like 'año' => 'a%C3%B1o'), that can still be useful. Maybe you wanna compress it for store it at a Unicode or binary database, or to pasteit at web sites (Facebook, Twitter, etc.).

3.b) Although Base64[2] it is a good method, you can squeeze more at the expense of speed (as you use user functions instead of compiled ones).

At least with Javascript's function encodeURI(), you can use any of these 80 characters at parameter value without suffering modifications:

0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz.:,;+*-_/()$=!@?~'

So, we can buil our one "Base 80" (d)encode functions.

ESL
  • 986
  • 11
  • 18
1

These functions will compress and decompress a string or an array.

Sometimes you might want to GET an array.

function _encode_string_array ($stringArray) {
    $s = strtr(base64_encode(addslashes(gzcompress(serialize($stringArray),9))), '+/=', '-_,');
    return $s;
}

function _decode_string_array ($stringArray) {
    $s = unserialize(gzuncompress(stripslashes(base64_decode(strtr($stringArray, '-_,', '+/=')))));
    return $s;
}
stubben
  • 309
  • 2
  • 6
  • 1
    Remember to *NEVER* unserialize data blindly. It's a very common security hole. – MM. Oct 13 '16 at 10:06
1

For long/very long string values, you would like to use POST method instead of GET !

for a good encoding you might wanna try urlencode()/urldecode()

or htmlentities()/html_entity_decode()

Also be carefull that '%2F' is translated to the browser as the '/' char (directory separator). If you use only urlencode you might wanna do a replace on it.

i don't recommend gzcompress on GET parameters.

1

Not really an answer, but a comparison of various methods suggested here.

Used answers by @Gumbo and @deceze to get length comparison for a fairly long string I am using in a GET.

<?php
    $test_str="33036,33037,33038,38780,38772,37671,36531,38360,39173,38676,37888,36828,39176,39196,37321,36840,38519,37946,36543,39287,38989,38976,36804,38880,38922,38292,38507,38893,38993,39035,37880,38897,38378,36880,38492,38910,36868,38196,38750,37938,39268,38209,36856,36767,37936,36805,39248,36777,39027,39056,38987,38779,38919,38771,36851,38675,37887,38246,38791,38783,38661,37899,36846,36834,39263,37928,36822,37947,38992,38516,39177,38904,38896,37320,39217,37879,38293,38511,38774,37670,38185,37927,37939,38286,38298,38977,37891,38881,38197,38457,36962,39171,36760,36748,39249,39231,39191,36951,36963,36755,38769,38891,38654,38792,36863,36875,36956,36968,38978,38299,36743,36753,37896,38926,39270,38372,37948,39250,38763,38190,38678,36761,37925,36776,36844,37323,38781,38744,38321,38202,38793,38510,38288,36816,38384,37906,38184,38192,38745,39218,38673,39178,39198,39036,38504,36754,39180,37919,38768,38195,36850,38203,38672,38882,38071,39189,36795,36783,38870,38764,39028,36762,36750,38980,36958,37924,38884,37920,38877,36858,38493,36742,37895,36835,37907,36823,38762,38361,37937,38373,37949,36950,39202,38495,38291,36533,39037,36716,38925,37620,38906,37878,37322,38754,36818,39029,39264,38297,38517,36969,38905,36957,36789,36741,37908,38302,38775,39216,36812,38767,36845,36849,39181,39168,38671,39188,38490,36961,39201,36717,38382,38070,37868,38984,36770,38981,38494,36807,38885,36759,36857,38924,39038,38888,38876,36879,37897,36534,36764,37931,38254,39030,38990,37909,38982,38290,36848,37857,37923,38249,38658,38383,36813,36765,36817,37263,36769,37869,38183,36861,38206,39031,36800,36788,36972,38508,38303,39051,38491,38983,38759,36740,37958,36967,37930,39174,39182,36806,36867,36855,39222,37862,36752,38242,37965,38894,38182,37922,37918,36814,36872,38886,36860,36527,38194,38975,36718,39224,37436,39032";

    echo(strlen($test_str)); echo("<br>");

    echo(strlen(base64_encode(gzcompress($test_str,9)))); echo("<br>");

    echo(strlen(bin2hex(gzcompress($test_str, 9)))); echo("<br>");

    echo(strlen(urlencode(gzcompress($test_str, 9)))); echo("<br>");

    echo(strlen(rtrim(strtr(base64_encode(gzdeflate($test_str, 9)), '+/', '-_'), '=')));
?>

Here are the results:

1799  (original length string)
928   (51.58% compression)
1388
1712
918   (51.028% compression)

Results are comparable for base64_encode with gzcompress AND base64_encode with gzdeflate (and some string transalations). gzdeflate seems to give slightly better efficiency

Madhur Bhaiya
  • 28,155
  • 10
  • 49
  • 57
-1

base64_encode makes the string unreadable (while of course easily decodable) but blows the volume up by 33%.

urlencode() turns any characters unsuitable for URLs into their URL-encoded counterparts. If your aim is to make the string work in the URL, this may be the right way for you.

If you have a session running, you could also consider putting the query string into a session variable with a random (small) number, and put that random number into the GET string. This method won't survive longer than the current session, of course.

Note that a GET string should never exceed 1-2 kilobytes in size due to server and browser limitations.

Pekka
  • 442,112
  • 142
  • 972
  • 1,088