Compress short buffers of unicode (utf16) strings

Question

I have several short utf16 strings that i need to compress (each about 20-200 chars long).

The string are almost always created using only english characters and numbers.

I can probably write some compression myself that will have about 50% results.

Looking for an idea/implementation.

Im using C#

I can convert it to UTF8 and achieve nearly 50% compression... :-) — xanatos, May 09 '15 at 15:10
The point is, what do you want to do with these strings once compressed? A compressed string, or a string converted to UTF8 is a `byte[]`, so something that isn't very good for working on it. You can save it, load it, transmit it. — xanatos, May 09 '15 at 15:12
You can see http://stackoverflow.com/a/7343623/613130 is what you want. .NET `string`s are UTF16 strings. — xanatos, May 09 '15 at 15:32
Its easy to use Gzip. But it doesnt manage to compress these short strings very efficiently. I'm looking for some better method, directed at the specific problem. — user972014, May 09 '15 at 19:10
You could try Deflate... But if you are looking for custom special algorithms optimized for text, then it is another problem — xanatos, May 09 '15 at 19:11

score 2 · Answer 1 · answered May 09 '15 at 19:17

Use UTF-8. It gives you the 50% you asked for.
You can easily achieve a little more by taking advantage of the fact that almost all high bits of those bytes will be zero for English text.
You can then apply a shared pre-computed Huffman tree to take advantage of the letter distribution.
For strings that are quite long (like >100 chars) I could imagine that using Deflate or something like it starts to become effective. I'd use Deflate after converting to UTF-8.
If you are willing to use a shared dictionary you can achieve a lot more compression. That dictionary would need to be pre-computed and shared with the entire corpus.

1 Answers1