2

I have millions of < 20 char strings, and I want to compress each of them individually.

Using zlib or lz4 on each string individually doesn't work: the output is bigger than the input:

inputs = [b"hello world", b"foo bar", b"HELLO foo bar world", b"bar foo 1234", b"12345 barfoo"]
import zlib
for s in inputs:
    c = zlib.compress(s)
    print(c, len(c), len(s))  # the output is larger than the input

Is there a way in Python (maybe with zlib or lz4?) to use a dictionary-based compression, with a custom dictionary size (for example 64 KB or 1 MB) that would allow compression of very short strings individually?

inputs = [b"hello world", b"foo bar", b"HELLO foo bar world", b"bar foo 1234", b"12345 barfoo"]
D = DictionaryCompressor(dictionary_size=1_000_000)
for s in inputs:
    D.update(s)
# now the dictionary is ready    
for s in inputs:
    print(D.compress(s))

Note: "Smaz" looks promising, but it is very much hard-coded and not adaptive: https://github.com/antirez/smaz/blob/master/smaz.c

Basj
  • 41,386
  • 99
  • 383
  • 673
  • Why can't you just use a for-loop? I think that any library function would just do the same. – Michael M. Oct 30 '22 at 14:12
  • Did you see https://stackoverflow.com/questions/1138345/an-efficient-compression-algorithm-for-short-text-strings , https://stackoverflow.com/questions/2011653/how-to-find-a-good-optimal-dictionary-for-zlib-setdictionary-when-processing-a , https://stackoverflow.com/questions/56189234/compression-of-short-strings ... some of them discussing the use of a custom dict with zlib? – Thierry Lathuille Oct 30 '22 at 14:19
  • @ThierryLathuille Yes, I have read a few of these questions/answers, but I haven't found a working implementation in Python that works for really short strings. Does Python zlib support custom dictionaries (and also, how to build them in Python?) https://docs.python.org/3/library/zlib.html – Basj Oct 30 '22 at 14:25
  • @MichaelM. I don't understand, I think we are not speaking about the same thing? The algorithm I'm looking for is definitively not just simple for loops. – Basj Oct 30 '22 at 14:26
  • @ThierryLathuille Smaz looks promising, but I'm looking for something less hard-coded and more "adaptive": https://github.com/antirez/smaz/blob/master/smaz.c – Basj Oct 30 '22 at 14:28

3 Answers3

4

Python's zlib interface does in fact provide zdict parameters for compressobj and decompressobj, as of version 3.3 (released ten years ago).

You can provide up to a 32K dictionary to aid in compressing short strings. I would also recommend the use of raw deflate streams to minimize the size (wbits=-15).

You can construct the 32K dictionary in many ways. A good starting point would be to simply concatenate a few thousand of your short strings. See if that permits compression of your short strings. Test with strings that are not in your dictionary.

You can also try zstd which should perform better than zlib, and which also supports dictionaries. zstd also has code to help you generate dictionaries. You would need to write your own Python interface to zstd.

I have not tried this, but it may be possible to use zstd's dictionary generation to make a good dictionary for zlib's deflate.

Lastly, I would try to preprocess my short strings based on what I know about them. If there is a way to tokenize contents of the strings that you know will be there, then you will have already compressed them some.

Mark Adler
  • 101,978
  • 13
  • 118
  • 158
  • Thanks for your answer @MarkAdler! [Here is](https://stackoverflow.com/a/74254326) some sample code based on this, it is promising indeed! Now the Graal for me would be to be able to put these compressed strings in a Sqlite database, but still keep fast prefix-search: https://stackoverflow.com/questions/74199484/compress-a-database-but-keep-fast-prefix-search-with-index. Would you have a hint for this? TL;DR: are "compressed" and "indexed-fast-query" mutually exclusive? – Basj Oct 30 '22 at 15:29
1

Example code, as an addition to @MarkAdler's solution:

import zlib
zdict = b"abcdefghijklmnopqrstuvwxyzhelloworldfoobar"
inputs = [b"hello world", b"foo bar", b"HELLO foo bar world", b"bar foo 1234", b"12345 barfoo"]
for s in inputs:
    co = zlib.compressobj(wbits=-15, zdict=zdict)
    c = co.compress(s) + co.flush()
    print(c, len(c), len(c) / len(s))

Here the output is always smaller the input:

b'\x03\xf3\x15\xc0\x02\x00' 6 0.5454545454545454
b'\x03\x92\n@\n\x00' 6 0.8571428571428571
b'\xf3p\xf5\xf1\xf1W\x00\xb2\x15\x80\x1c\x05\xb0\x04\x00' 15 0.7894736842105263
b'\x03"\x05 K\xc1\xd0\xc8\xd8\x04\x00' 11 0.9166666666666666
b'34261U\x002\x80\\\x00' 11 0.9166666666666666
Basj
  • 41,386
  • 99
  • 383
  • 673
0

Why not compressing a list of all of the strings as a way of compressing each one of them?

Consider to write your own Python code for this purpose. It should not be hard to do and it will allow you any possible individual optimization you can imagine.

If you take a library for this purpose ( e.g. pyshoco https://pypi.org/project/pyshoco/#files or any other possible one) you have to 'train' (i.e. adapt) it to your specific data anyway, so I suggest to put this steps together writing an own compressor code which for example as first step builds a dictionary of substrings along with the number of their occurrences from all your short strings to see what makes sense as next step to do.

The most simple compressor code will be to build a list out of a set containing all of your short strings in order to 'compress' them 'individually' to their index in this list. Then compress the list and see see how big the "dictionary" (i.e. the list or an actual dictionary of index:string pairs) will become.

It is so simple that probably hard to see it as a best approach to achieve the desired goal ...

Claudio
  • 7,474
  • 3
  • 18
  • 48