1

I have several short utf16 strings that i need to compress (each about 20-200 chars long).

The string are almost always created using only english characters and numbers.

I can probably write some compression myself that will have about 50% results.

Looking for an idea/implementation.

Im using C#

user972014
  • 3,296
  • 6
  • 49
  • 89
  • 1
    I can convert it to UTF8 and achieve nearly 50% compression... :-) – xanatos May 09 '15 at 15:10
  • The point is, what do you want to do with these strings once compressed? A compressed string, or a string converted to UTF8 is a `byte[]`, so something that isn't very good for working on it. You can save it, load it, transmit it. – xanatos May 09 '15 at 15:12
  • You can see http://stackoverflow.com/a/7343623/613130 is what you want. .NET `string`s are UTF16 strings. – xanatos May 09 '15 at 15:32
  • Its easy to use Gzip. But it doesnt manage to compress these short strings very efficiently. I'm looking for some better method, directed at the specific problem. – user972014 May 09 '15 at 19:10
  • You could try Deflate... But if you are looking for custom special algorithms optimized for text, then it is another problem – xanatos May 09 '15 at 19:11

1 Answers1

2
  1. Use UTF-8. It gives you the 50% you asked for.
  2. You can easily achieve a little more by taking advantage of the fact that almost all high bits of those bytes will be zero for English text.
  3. You can then apply a shared pre-computed Huffman tree to take advantage of the letter distribution.
  4. For strings that are quite long (like >100 chars) I could imagine that using Deflate or something like it starts to become effective. I'd use Deflate after converting to UTF-8.
  5. If you are willing to use a shared dictionary you can achieve a lot more compression. That dictionary would need to be pre-computed and shared with the entire corpus.
usr
  • 168,620
  • 35
  • 240
  • 369