I have millions of < 20 char strings, and I want to compress each of them individually.
Using zlib
or lz4
on each string individually doesn't work: the output is bigger than the input:
inputs = [b"hello world", b"foo bar", b"HELLO foo bar world", b"bar foo 1234", b"12345 barfoo"]
import zlib
for s in inputs:
c = zlib.compress(s)
print(c, len(c), len(s)) # the output is larger than the input
Is there a way in Python (maybe with zlib or lz4?) to use a dictionary-based compression, with a custom dictionary size (for example 64 KB or 1 MB) that would allow compression of very short strings individually?
inputs = [b"hello world", b"foo bar", b"HELLO foo bar world", b"bar foo 1234", b"12345 barfoo"]
D = DictionaryCompressor(dictionary_size=1_000_000)
for s in inputs:
D.update(s)
# now the dictionary is ready
for s in inputs:
print(D.compress(s))
Note: "Smaz" looks promising, but it is very much hard-coded and not adaptive: https://github.com/antirez/smaz/blob/master/smaz.c