In-memory compressed seldom-accessed data

Question

Background

I'm working on a python project (version 2.7) that generates a large amount of data over a long span of time (i.e. days to weeks). The amount of data generated over a week can be anywhere from 50MB to 2GB of data, roughly. The embedded system that runs this project has a decent amount of RAM (i.e. 64GB), but no persistent storage (data is offloaded over the network at random intervals), and I'd like to keep the amount of memory used by this project to minimum, while holding onto the in-memory data as long as possible.

Problem

This program is holding onto a massive in-memory list of string data, and I need to get the memory usage down. This data is not ready frequently (i.e. once per day), but is updated frequently (i.e. 1MB of data added at random intervals). I'm considering compressing it in-memory, since it's just a bunch of human-readable strings of ASCII-formatted text, and could hold onto data for a longer period of time before having to prune/delete the "oldest" entries.

Question

Is there a way, using standard/built-in Python functionality (i.e. no third party modules permitted), to:

Create an in-memory, compressed list (to reduce the in-memory size of this massive list-of-strings).
Allow for individual list entries to be extracted on-demand from the list.
Allow for deleting individual entries (i.e. the "oldest" entries if the compressed in-memory list grows too large).

Work So Far

Compression: I've found examples on using zlib and pickle to store a compressed copy of data in-memory, but this doesn't appear to allow me to index into the compressed data to extract individual entries without first de-compressing the entire object.
Deletion and extraction: Same issue as above, where it's an all-or-nothing approach.
I could also create a list of compressed entries, but I'd prefer to compress the entire list rather than one entry at a time, to improve the compression ratio (i.e. take advantage of solid compression, since the same data is repeated often from one entry to another).

"since it's just a bunch of human-readable strings of ASCII-formatted text" what sort of text? Repetitions of words that could be indexed, maybe? — Jblasco, Sep 10 '18 at 15:25
@Jblasco Mainly logging data. It would lend itself well to solid compression, but is too varied/complex to due a hand-written/one-off dictionary replacement algorithm. — Cloud, Sep 10 '18 at 15:39
Why not just use `sqlite3`? It can be configured to create a `:memory:` temporary database, and it'll back stuff to disk as needed. — Martijn Pieters, Sep 10 '18 at 15:41
Is this a huge list of small strings, or a short list of huge strings? — abarnert, Sep 10 '18 at 15:43

abarnert · Accepted Answer · 2018-09-10T16:10:45.447

Assuming you have a huge list of small strings, there’s nothing that’s going to work out of the box, but it’s not too hard to build something that will do what you want.

zlib isn’t great for compressing individual short strings. But if you compress everything into one giant zlib stream, you don’t get random access.

So, without getting under the covers of the zlib stream, what can you do?

Chunk your strings up into mid-sized chunks that are big enough to be worth compressing but small enough to decompress on the fly. The number of strings per chunk is a tunable parameter that you’ll need to benchmark with different values.

The idea is something like this (untested, bordering on pseudocode):

class CompressedStringList(collections.abc.Sequence):
    def __init__(self, iterable, chunksize):
        self.chunksize = chunksize
        self.length = 0
        self.chunks = []
        for group in grouper(iterable, chunksize):
            chunk = '\0'.join(group)
            self.chunks.append(zlib.compress(chunk))
            self.length += len(group)
    def __len__(self):
        return self.length
    def __getitem__(self, index):
        # add handling for negative indices and slices
        chunk, offset = index % self.chunksize
        group = zlib.decompress(self.chunks[chunk])
        return group.split('\0')[offset]

Depending on how you want to use this, you may want to wrap a small LRU cache (maybe even with maxsize of 1 or 2) around the zlib.decompress bit. That way, if you try to iterate the list or slice it, you’ll reuse the same decompressed chunk over and over instead of decompressing it over and over. But if you truly are doing just random access, the cache would be a small overhead for no gain.

To delete values, expanding this to a MutableSequence, isn’t trivial, because you don’t want to rechunk and decompress everything.

But it’s still not hard if you only generally delete from the left (as implied by your mention of throwing out the oldest values), what you need to do is store an offset, a count of deleted items (mod the chunksize). When you delete the leftmost element, instead of actually deleting, just increment the offset; if it reaches chunksize, then del self.chunks[0]. Then, in your getattr, if not chunk: offset += self.offset. You might also want to consider using a collections.deque in place of a list for your chunks.

If you need to delete with random access, then each chunk can keep a delete count so you only need to decompress and compress a single chunk, but then you can no longer just use % to get the chunk; you need to keep an index of counts per chunk so you can bisect to find the right chunk for any index.

In-memory compressed seldom-accessed data

Background

Problem

Question

Work So Far

1 Answers1