Compress a short but repeating string

Question

I'm working on a web app that needs to take a list of files on a query string (specifically a GET and not a POST), something like:

http://site.com/app?things=/stuff/things/item123,/stuff/things/item456,/stuff/things/item789

I want to shorten that string:

http://site.com/app?things=somekindofencoding

The string isn't terribly long, varies from 20-150 chars. Something that short isn't really suitable for GZip, but it does have an awful lot of repetition so compression should be possible.

I don't want a DB or Dictionary of strings - the URL will be built by a different application to the one that consumes it. I want a reversible compression that shortens this URL. It doesn't need to be secure.

Is there an existing way to do this? I'm working in C#/.Net but would be happy to adapt an algorithm from some other language/stack.

is `/stuff/things/item*` always the same ? If yes, why don't you simply pass `123,456,789` ? — Steve B, Jun 12 '12 at 09:14
what are the constraints on the data? can they be expressed with (e)BNF? Ie is the data regular in format? — Rune FS, Jun 12 '12 at 09:14
@SteveB not always, but one call might be lots of `/stuff/things/item` and another `/files/item` or `/items/thing` or whatever. — Keith, Jun 12 '12 at 09:19
if you know in advance the kind of arguments, you can use something like : `http://site.com/app?things=123,456,789&files=ABC,DEF` In fact, you should describe what kind of data will the url contain. — Steve B, Jun 12 '12 at 09:22
@RuneFS The data is roughly file names, so `a-zA-Z0-9_-,.~/\` — Keith, Jun 12 '12 at 09:22
http://stackoverflow.com/questions/1192732/really-simple-short-string-compression — duedl0r, Jun 12 '12 at 11:31
@duedl0r cheers, but the answers there are mostly along the lines of GZipped compression actually being longer for small amounts of data or suggesting the use of a DB/dictionary, both of which I cover in the question. — Keith, Jun 12 '12 at 12:11
"I don't want a DB or Dictionary of strings - the URL will be built by a different application to the one that consumes it. I want a reversible compression that shortens this URL." I don't understand this. You must be writing the code on both ends. Then you simply make the information needed for compression and decompression available and identical at both ends. — Mark Adler, Jun 12 '12 at 14:23
@MarkAdler I am writing the code at both ends, but they don't share an application instance (ruling out a static `Dictionary`) or a common network (ruling out a database). The problem is that the information needed at both ends is unknown in advance - it needs to be passed as part of the compression. — Keith, Jun 12 '12 at 14:37
The dictionary would be built into the code. If you can write code at both ends, then you can include information at both ends. It sounds like you're saying that there are no likely strings known a priori. — Mark Adler, Jun 12 '12 at 14:42
@MarkAdler yeah, I don't think I've explained that every well :-S I've no idea at all what will be in the list, but generally there is a lot of repetition. — Keith, Jun 12 '12 at 14:48
@Keith: I think the conclusion from the previous link is that you should not compress it with zip or whatever, because it's probably a bad idea, no? He mentions using a DB, but you said you don't want to use it. Maybe huffmann-coding helps? But IMO is too complicated for this problem.. — duedl0r, Jun 12 '12 at 14:57

Rune FS · Accepted Answer · 2012-06-12T11:19:11.290

1

If you can express the data in BNF you could contruct a parser for the data. in stead of sending the data you could send the AST where each node would be identified as one character (or several if you have a lot of different nodes). In your example

we could have

files : file files
      | 
file : path id
path : itemsthing
     | filesitem
     | stuffthingsitem

you could the represent a list of files as path[id1,id2,...,idn] using 0,1,2 for the paths and the input being:

/stuff/things/item123,/stuff/things/item456,/stuff/things/item789
/files/item1,/files/item46,/files/item7

you'd then end up with ?things=2[123,456,789]1[1,46,7]

where /stuff/things/item is represented with 2 and /files/item/ is represented with 1 each number within [...] is an id. so 2[123] would expand to /stuff/things/item123

EDIT The approach does not have to be static. If you have to discover the repeated items dynamically you can use the same approach and pass the map between identifier and token. in that case the above example would be

?things=2[123,456,789]1[1,46,7]&tokens=2=/stuff/things/,1=/files/item

which if the grammar is this simple ofcourse would do better with

?things=/stuff/things/[123,456,789]/files/item[1,46,7]

compressing the repeated part to less than the unique value with such a short string is possible but will most likely have to be based on constraining the possible values or risk actually increasing the size when "compressing"

edited Jun 12 '12 at 11:19

answered Jun 12 '12 at 09:33

Rune FS

21,497
7
62
96

Ok, that looks promising, but it feels like re-inventing the wheel to write something like that from scratch. I also don't know (in advance) exactly what the repeated parts of the strings are so I'd need to figure that out dynamically too. Is there anything existing out there that does this? – Keith Jun 12 '12 at 09:54
Since it's simply a key-value mapping you could both argue that there's a not of implementations (of key-value maps) while arguing that the simplicity of the approach would warrant a framwork solution after all it's less than say 200 LOC – Rune FS Jun 12 '12 at 09:59
Your method is perfect if I know the patterns that will be passed, but unfortunately I can't be that certain - it might get something like `/stuff/things/X,item/stuff/Y,things/item/1`. I'm looking for something that identifies from that pattern that "stuff", "thing", "item" are repeated and could be substituted. – Keith Jun 12 '12 at 10:53
Your edit makes good sense, but I'm not sure how to apply it. I need to quickly figure out what the repeating patterns are and tokenise them ("stuff", "thing", "item" are just my made up example). So `/stuff/things/X,item/stuff/Y,things/item/1` would become something like: `/#a/#b/X,#c/#a/Y,#b/#c/1|1=stuff2=things3=item` (which is unfortunately longer :-S). – Keith Jun 12 '12 at 14:44
@Keith You could compress that to '/12[X]31[Y]23[1]1stuff2things3item' Which is 34 characters long or less than 5/6 of the original (with a convention of each term ending in / otherwise the length would be 37 which would be 6/7 of the original). If all terms are less than three characters and only used once then the org will be shorter but for any compression algorithm there will be those inputs that yield a larger output – Rune FS Jun 12 '12 at 17:45

score 0 · Answer 2 · answered Jun 12 '12 at 15:56

You can try zlib using raw deflate (no zlib or gzip headers and trailers). It will generally provide some compression even on short strings that are composed of printable characters and does look for and take advantage of repeated strings. I haven't tried it, but could also see if smaz works for your data.

I would recommend obtaining a large set of real-life example URLs to use for benchmark testing of possible compression approaches.

Compress a short but repeating string

2 Answers2