4

I'm looking for an algorithm that would compress some string to another string (i.e. without "\0" or special control characters), but I can't find anything on the internet. Is there such an algorithm? It doesn't have to be particularly efficient, just something basic.

laurent
  • 88,262
  • 77
  • 290
  • 428
  • see http://stackoverflow.com/questions/1138345/best-compression-algorithm-for-short-text-strings – default locale Sep 21 '11 at 10:57
  • 5
    Any compression algorithm does exactly that. – static_rtti Sep 21 '11 at 10:59
  • @MAKKAM, this is not really related. He wants an algorithm for a specific kind of string (short strings) while I need something general. – laurent Sep 21 '11 at 10:59
  • 1
    @static_rtti, I'm looking for a compression algorithm that would output a string that can be copied and pasted. I think most compression algorithms output binary data with `\0` characters. – laurent Sep 21 '11 at 11:29
  • @Laurent: Nothing wrong with strings containing `0x0` in my corner of the universe, they're still strings (that your OS can't copy and paste them is ... rather unfortunate, but not a law of nature). Are you asking about a) specific language, b) you need to have a specific set of characters in the output? – Piskvor left the building Sep 21 '11 at 11:59
  • @Piskvor, yes I guess my question is that I want an output that only contains printable characters, so that it can be copied and pasted, sent by email, etc. I don't think it's just an OS thing, many UI and programs will have trouble copying and pasting strings with `0x0`. – laurent Sep 21 '11 at 12:17
  • try [Huffman coding](http://en.wikipedia.org/wiki/Huffman_coding) and [RLE](http://en.wikipedia.org/wiki/Run-length_encoding) – Aman Agarwal Sep 21 '11 at 11:20
  • You could try [Smaz](https://github.com/antirez/smaz), or adapt it to your exact needs. – Simon Mourier Sep 21 '11 at 12:07

4 Answers4

8

Easy:

$ echo "Hello world" | gzip -c | base64
H4sIALnHeU4AA/NIzcnJVyjPL8pJ4QIA1eA5twwAAAA=

$ echo "H4sIALnHeU4AA/NIzcnJVyjPL8pJ4QIA1eA5twwAAAA=" | base64 -d | gzip -dc
Hello world

Note: looks like there is no compression, but for bigger data the compression ratio will be better :-)

Tomas
  • 57,621
  • 49
  • 238
  • 373
  • base64 is not as efficient as possible under the constraints the OP mentioned, there are over 90 printable ASCII characters and even more in extended-ascii – harold Sep 21 '11 at 11:45
  • 2
    Establishes the principle, though. You could substitute Ascii85, or some other scheme using even more printable characters, and the questioner says, "It doesn't have to be particularly efficient". – Steve Jessop Sep 21 '11 at 11:51
  • 2
    @harold, yes, might not be as space efficient as possible, but the OP said *"It doesn't have to be particularly efficient, just something basic"* - and that's it! Base64 is very basic, easy and portable. It is ready-to-use in many languages. And there's no need to implement anything else for yourself, no need to reinvent the wheel. – Tomas Sep 21 '11 at 11:52
  • does it have to be particularly inefficient? ok it's not that bad, but it kind of hurts my "it could have been better" senses. If there wasn't a compression stage first it would have made sense to me to just base64 it, but now not so much. – harold Sep 21 '11 at 12:10
  • I tried this, and had to add `--ignore-garbage` to `base64 -d` – Stephen Denne Sep 21 '11 at 22:09
  • @Stephen: I don't need to do that on either Ubuntu or Cygwin, but rather than `base64 -i`, maybe try `echo -n` (to omit the newline). `base64` is *supposed* to ignore newlines, though. – Steve Jessop Sep 21 '11 at 22:13
  • Not wrapping worked too: `man base64 | gzip -c | base64 -w0 | base64 -d | gzip -dc` – Stephen Denne Sep 21 '11 at 22:17
  • @Stephen: Hmm, `man base64 | gzip -c | base64 | base64 -d | gzip -dc` (that is, no `-w0`) also works for me on both. The output of `base64` should be valid input to `base64 -d`, regardless of options, but for you it seems to fail to ignore newlines without `-i`. Odd. – Steve Jessop Sep 21 '11 at 22:21
  • I'll have to give that a try but I'm not sure encoding it in base64 actually achieves any sorts of compression. – laurent Sep 23 '11 at 05:08
3

Apparently you have some specific character set in mind and you want to use it for both the original string and the compressed string.

Standard compression routines (e.g. gzip) work on byte strings.

One idea is to take existing code (e.g. gzip's) and rewrite it to use your character set instead of bytes.

Another is to construct a 1-to-1 mapping between strings in your character set and arbitrary byte strings, map the original string to a byte string, compress the byte string using a standard compression utility or function, and map the result back to a string using your character set. (Strictly speaking you can use two different mappings.)

One way to construct the mapping is to pad your character set with dummies and a special pad character until you have 2^k different characters (for some k); then each 8 of your characters correspond to k bytes (and shorter strings can be padded with the pad character).

reinierpost
  • 8,425
  • 1
  • 38
  • 70
3

Your requirement for no "special characters" is very restrictive, unless you can guarantee that a subset of characters (say "~") will never be used. Then you can use those characters to mark your compression:

~a -> the
~b -> The
~c -> and
~d -> And
~e -> Sirius Robotics Corporation Ltd.
etc.

Just add commonly used words to the codebook. The codebook can be fixed, as above, or vary with the text to be compressed. Either way the decompressing side will need access to the correct codebook to do the decompression.

rossum
  • 15,344
  • 1
  • 24
  • 38
1

As far as I can tell, the most popular compression algorithm that allows standard C string-handling routines to be re-used to handle compressed text (i.e., carefully avoids putting any 0x00 bytes in the compressed string, except as the end-of-compressed-data marker) is simple byte-pair encoding, also called dual-tile encoding or DTE. DTE is often used to compress text in video game ROMs.

When the DTE decompressor prints out a DTE-compressed string, it reads 1 byte at a time from the DTE-compressed string and prints out 1 or two bytes:

  • compressed byte B in the range 0x01..0xFF: the decoder uses that as an index into the "dictionary" and prints out the 1 or 2 bytes stored in the dictionary at that index.
  • compressed byte B is 0x00, that's the end of the string -- done.

A typical DTE implementation has a hard-wired dictionary stored in both the encoder and the decoder something like this:

  • indexes of frequently-used letters -- perhaps the entire ASCII isprint() range 0x20 to 0x7e, and the newline character 0x0A -- represent themselves. (The compressed byte 'a' is decoded as the single letter 'a')
  • indexes from 0xc0 to 0xff: the byte is decoded into 2 characters: a space character, and a letter formed from this byte XORed with 0x80. (The compressed byte (0x80 xor 'a') is decoded into 2 characters, the space character and the letter 'a').
  • Any other available indexes ( 0x7f..0xbf ) store other common bigrams ("th", "re", etc.).
David Cary
  • 5,250
  • 6
  • 53
  • 66