8

I want to take an arbitrary string of ASCII text, like "Hello world", and compress it into a version with fewer characters (as few as possible), but in a way that it can be decompressed. The compressed version should be composed only of ascii characters. Is there a way to accomplish this, especially in Ruby?

dan
  • 43,914
  • 47
  • 153
  • 254
  • You want to compress text in an alphabet consisting of `a-zA-Z` into an alphabet consisting of `a-zA-Z`? I don't think that's possible. To reduce the length, you need to increase the available characters. If your input alphabet is limited to, say, `a-zA-Z` and your output can contain all 255 ASCII codepoints, you may be on to something... – deceze Jan 27 '11 at 03:04
  • When you say that the compressed version should be composed only of ascii characters, do you mean that 0x00-0x19 chars are not allowed? If you take down the possible chars to A-Za-z0-9, you might be able to get 5 chars / int. But it won't be an ascii string anymore, though – Waneck Jan 27 '11 at 03:07
  • @deceze If it couldn't be done, binary files could not be compressed (as thay are already 8 bits). It can be done, but you'll get an output shorter than the input only if you have (a sizable number of) repetitions and so a dictionary helps. – Dr. belisarius Jan 27 '11 at 03:41
  • @belisarius Yes, you're right. Point accepted. It *is* possible, just hardly so with strings as short as "Hello world". – deceze Jan 27 '11 at 03:54
  • 1
    @deceze The cutoff point when your repetitions start to gain some benefit for you depends on the compression algorithm. But one thing is sure: you need repetitions comprising more than one char ... and _Hello world_ is not the most beautiful example ... – Dr. belisarius Jan 27 '11 at 04:00

5 Answers5

8

If you know that only ASCII characters will be used, that is the 7 low order bits of every byte. With bit manipulation, you can mash every 8 bytes into 7 (12.5% savings). If you can get it into a smaller range (64 valid chars only), you can drop another byte.

However, because you want the compressed form to ALSO contain only ASCII characters, that loses you one byte - which goes back to square one unless your input can be restricted to 64-chars (e.g. lossy compression substituting some chars with others, storing only in lower case etc).

If your strings are not large (>1k), then there is minimal savings to be had using gzip/bzip2 etc because of the size of the headers. If you had a predefined dictionary to use as a Huffman table, you may get some compression but in other cases, you can get bloat against the original text.

Prior discussion on SO An efficient compression algorithm for short text strings

Community
  • 1
  • 1
RichardTheKiwi
  • 105,798
  • 26
  • 196
  • 262
  • 2
    The question, as I read it, is about compressing text where the output is also 7-bit ASCII. Removing the high bit isn't going to work as compression in that case. – Nick Johnson Jan 27 '11 at 03:29
4

There are many good text compression algorithms like Huffman encoding or LZW that are good at compressing text strings into bitstrings with many fewer bits than the standard ASCII encoding. Once you have such an encoding, you can always split the bitstring up into groups of seven bits to pack them into standard ASCII characters. I'm sure that there are libraries out there that do this, but I'm not much of a Ruby coder and don't know of any off the top of my head.

templatetypedef
  • 362,284
  • 104
  • 897
  • 1,065
1

The simplest way to do this would be to compress it using a standard algorithm, then base64 encode the result. This isn't likely to help on a string as short as 'Hello world', though - at that size, there's very little you could do to decrease the size, unless all your strings have a similar restricted character set, or patterns that something like huffman encoding can take advantage of.

Nick Johnson
  • 100,655
  • 16
  • 128
  • 198
0

If your language is given, say english, then you can get away by leaving common characters out if your word stays unambiguous. For example, "Hello world" could become "Hll wrld" if your dictionary only contains Hello to match Hll and world to match wrld. Semitic languages like arabic actually have no vocals in their written language, and people still manage to read them. Also, other rules like when a word is supposed to be uppercase can be used to reduce the character set to lower case characters(supposing that a given text follows these rules).

Also, while byte-wise compression works well for texts, actual natural language can be far better compressed if you encode whole words, because the vocabulary size is very limited (even more limited if you look at a restricted set of texts). But that was not the question, i'm getting off-topic here.

kutschkem
  • 7,826
  • 3
  • 21
  • 56
0

ascii to ascii compression

generally, see Lempel Ziv compression

you may have additional constraints for the output format, such as ...

  • git friendly / vcs friendly: compressor should reuse symbol table of previous compression, to minimize diff between compressions
  • append only: compressor can append new symbol table items to the file body (not only in the file header)
milahu
  • 2,447
  • 1
  • 18
  • 25