7

The problem:

  • We need to serilize an array into a short string, the shorter the better.
  • The importance is more reflected on the bigger array than the smaller one.
  • The string will be used in a get request so it has to be url decoded.

Current code snippet

  /* 
    array (size=3)
      0 => string '/js/dhdbm78hdfb.js' (length=18)
      1 => string '/js/dfg4dg.js' (length=13)
      2 => string '/js/fg8hfhrt.js' (length=15)
      2 => string '/js/yjtdygj.js' (length=14)
      2 => string '/js/2q5g54rvfd.js' (length=17)
   */

  $json = json_encode($data);
  $gz = gzdeflate($json, 9);
  $base64 = base64_encode($gz);
  $paths = urlencode($base64);

  // outputs: i1aK0c8qjtFPyUhJyjW3yEhJS9LLKlbSgQmnpZukpCOLpKVbZKRlFJUgi1VmlaRUpmchCxkVmqabmhSVpaWARGMB

Not very impressive and pretty slow, I think there should be a better way of doing this...

Question

What would be the best approach to this problem? How can we present the smallest string possible?

PS

It's not the biggest issue if it's slow but it's a variable to be considered. The array will be hashed and retrived from cache when possible.

superhero
  • 6,281
  • 11
  • 59
  • 91
  • 1
    Extract the unique part of the url and then concatenate it with a seperator not used in the strings. Example: `dhdbm78hdfb:dfg4dg:fg8hfhrt:yjtdygj:2q5g54rvfd` – datasage Mar 27 '13 at 18:22
  • 1
    Algorithm being _fast_ and the output being _small_ at the same time is not possible. – Salman A Mar 27 '13 at 18:26
  • @datasage The reason I'm doing it like this is because the array is sometime multidimensional. Also, the strings in array could be of any sort. The example above is just a dummy. But I believe such an implementation you talk about could still be a good way to go. – superhero Mar 27 '13 at 18:28
  • @SalmanA As stated in the PS, it should only be considered.. it is not the biggest issue. Though a valid point :) – superhero Mar 27 '13 at 18:29
  • @ErikLandvall You will probably have a hard time getting much compression out of gzcompress/deflate. Especially if you have to turn around and base 64 encode it. If you have a predictable format, you may be able to serialize it by removing data and re-adding it later. – datasage Mar 27 '13 at 18:32

3 Answers3

2

If you want it as fast and as small as possible, ditch the general purpose tools and use a method suited to the particular data you're sending. Strip out anything duplicated between all of the array values (e.g. "/js/" and ".js"), strip out all of the array syntax from serialization, and just send a string concatenated list of your unique values, and reconstitute it on the receiving end.

Further compress it with gz and base64_encode if you must, but one of the drawbacks of having unique data is that it must be uniquely represented.

See the selected answer here: How to compress/decompress a long query string in PHP?


Edit: do you have the leeway to POST the data instead? That will at least avoid the direct size restrictions associated with the querystring. It'll also avoid the necessity to encode the string for the URL, adding length, and you should be able to send the zipped content directly, for most savings.

Community
  • 1
  • 1
Jason
  • 13,606
  • 2
  • 29
  • 40
1
serialization = json_encode
compression = gzdeflate
urlsafe = base64_encode

urlencode isn't needed as base64_encode produces safe characters only. I'd keep the base64_encode in there. You could consider finding a faster serializer or a faster compressor. You could even leave out the compressor if your input is short enough.


Algorithms have performance characteristics in two dimensions: number of instructions and memory usage. Typically you'll find that if you try to reduce one the other goes up. Indexes in a database allow for fast finding but take up a lot of space. Index-less finding is slow but doesn't require extra space for indexes.

Since you want short that means you need to make sacrifices in terms of fast.


Honestly, I think you've got a pretty optimal set here.

Halcyon
  • 57,230
  • 10
  • 89
  • 128
  • A max of 2 `=`. In general base64 adds 33% overhead. I think that's reasonable. You could write a transcriber that uses more than 64 characters but you'd have to find enough URL-safe characters and you might lose performance if you need to make your own implementation vs. native. – Halcyon Mar 27 '13 at 18:26
-1

Try to use gzcompress() function

http://php.net/manual/en/function.gzcompress.php