Compression of ASCII strings in C

Question

I have some C code that stores ASCII strings in memory as a four byte length followed by the string. The string lengths are in the range 10-250 bytes.

To reduce occupancy I'd like to compress each string individually on the fly, still storing the length (of the compressed string) followed by the compressed string.

I don't want to compress at a larger scope than individual strings because any string can be read/written at any time.

What libraries/algorithms are available for doing this?

Thanks for your help. NickB

score 14 · Accepted Answer · answered Jul 08 '09 at 10:10

14

ZLib is always at your service - it has a very little overhead for the cases when the string contains uncompressable data, it's relatively fast, free and can be easily integrated into C and C++ programs.

answered Jul 08 '09 at 10:10

sharptooth

167,383
100
513
979

score 10 · Answer 2 · answered Jul 30 '10 at 01:56

Most compression algorithms don't work very well with short strings. Here are a few compression algorithms that are designed to compress short English text strings. While they can handle any arbitrary byte in the plaintext string, such bytes often make the "compressed" data longer than the plaintext. So it's a good idea for the compressor to store "uncompressible" data unchanged and set a "literal" flag on such data (as Steve Jessop suggested).

"base 40 encoding": maximum compression 3:2
"Zork Standard Code for Information Interchange" (ZSCII): maximum compression 3:2
byte pair compression: maximum compression 2:1
a static Huffman table shared among all the strings (as suggested out by cygil).
- ideally, formed from the exact character frequencies of all of your actual data.
- Varicode: maximum compression 2:1
PalmDoc compression (byte pair compression + a simple variant of LZ77).

score 4 · Answer 3 · answered Jul 08 '09 at 10:22

Why use a 4 byte length when strings are 10-250 bytes long, use a 1 byte length that will save you 3 bytes per string alone.

Is the data textual only ie 0-9 A-z or some sub set?? if so re-encode it to use that subset and save a few bits per character.

Now have a look at http://gnosis.cx/publish/programming/compression_primer.html in the Huffman encoding section and lempel-zev section.

That should get you started.

score 4 · Answer 4 · answered Jul 08 '09 at 10:52

I am not sure that the zlib or LZW compression approaches will work well in the case of individually compressing short strings of less than 250 bytes. Both typically require creating a fairly sizable dictionary before significant compression gains are seen.

Perhaps simple Huffman coding with a fixed encoding tree, or one shared between all instances of the strings? Also, have you seen the ZSCII encoding used to compress short strings on memory constrained microcomputers in the 80s?

link text

score 3 · Answer 5 · answered Jul 08 '09 at 10:22

3

Zlib is definitely your friend here, but be sure to perform a few tests to detect the average string length at which compression starts to be beneficial, because of the small overhead of compression headers.

For example, you might discover that under 20 characters, the compressed string is actually bigger, and therefore only compress the longer strings.

answered Jul 08 '09 at 10:22

small_duck

3,038
20
28

And if you can spare 1 bit of the size field to flag whether the string is compressed or not, you don't even have to guess: just attempt to compress each string. If it gets smaller, store it compressed. If it doesn't, store it uncompressed. This is roughly what PKZIP allows (and I assume other compressed containers, it's just PKZIP is the one I happen to have implemented once). Unfortunately the size range 10-250 doesn't efficiently admit a "spare" bit on an 8-bit architecture. – Steve Jessop Jul 08 '09 at 10:39

score 1 · Answer 6 · edited May 23 '17 at 11:47

When using multiple strings like this it is possible to avoid the pointer overhead for each string (4 or 8 bytes each) by concatenating them together with \0s (1 byte) and using a lookup function.

#include <stdio.h>

static const char strings[]="hello\0world\0test";

char * nthstring(const char *s, unsigned n){
    while(n--)
        while(*s++)
        ;
    return s;
}
int main(void) {
    printf("%s\n",nthstring(strings,1));
    return 0;
}

However if the string length is less than UCHAR_MAX you can optimize the lookup by using the zero byte place holders to store lengths (plus 1 extra at the beginning) This costs only 1 additional data byte but saves a lot of conditional jumps and increments in the lookup function.

#include <stdio.h>
/* each "string" is prefixed with its octal length */
static const char lenstrings[]="\05hello\05world\04test";

char * ithstring(const char *s, unsigned n){
    while(n--){
        s+=*s+1;
    }
    return s;
}
int main(void) {
    char *s=ithstring(lenstrings,1);
    /* use the length because we don't have terminating \0 */
    printf ("%.*s",(unsigned char)*s,s+1);
    //write(1,s+1,(unsigned char)*s); //POSIX variation via <unistd.h>
    return 0;
}

For both variations it is better to keep the most often needed strings first; however, the second method will allow you to use compressed data (pick whichever works best for your data - David Cary's answer has a list of workable solutions) as long as you adjust the length separators to the compressed length.

Note: To get the maximum compression out of standard compressors, you will likely want to modify the length field of their headers to be unsigned char (or unsigned short if string lengths exceed 256 but not 65536 bytes) as most of them will try to support compression of large files (this could save 3-7 bytes per string)

Compression of ASCII strings in C

6 Answers6

Linked