1

For data compression, I want to replace values in a (long) list with an index about when the value appeared last. So the list:

18499   10123   5678   10123   10123   3344   10123   5678   4912   18499

would be replaced as follows:

N18449  N10123  N5678  K1      K0      N3344  K1      K2     N4912  K4

New values, that have not been seen before, are prefixed with N. Old values, that are already known, are prefixed with K. The second occurence of 10123 is for example replaced with K1, as there is one other value (5678) between. However, to keep indexes as low as possible, I want the K-indexes not to measure the distance in the list, but the actual number of unique other values seen between the last and the current occurence of a value. So, for example, the second occurence of 5678 is replaced with K2, as there are two other values (10123 and 3344) between them, even though the 10123 is repeated a few times. Similiarly, the last value 18499 is replaced with K4, as there are four other values betweeen it and the beginning of the list (which is also 18499). If just the distance was measured, the last element would be K9.

At first, it looks, that the compression/index replacement can be done using an LRU cache, for which stackoverflow holds some very good references (namely on LRU cache design). Unfortunately, though, classic LRU caches are not very good for this purpose: While the lookup, if an item is among the last N (the LRU cache size) items is fast with O(1), the lookup of the actual position of an item in the LRU cache is O(n) (with n the number of elements before the found one).

The same slowness holds for the decompression step, when Kn needs to be replaced with the corresponding value again: Walking through the linked list of a classic LRU cache to find the item to replace with, requires n steps.

I am quite aware, that an O(1) solution doesn't exist for my problem, as the counters about how many other items are in front need to be updated each time that a new element is added to the cache or an existing one is moved to the front. But is there an O(log(n)) solution? If at all possible, I don't want to go with O(N).

I want to implement this in C++, but pointers to such a data structure in any other programming language would also be appreciated. I am asking more about an algorithm than a specific implementation here.

Jeffrey
  • 11,063
  • 1
  • 21
  • 42
Kai Petzke
  • 2,150
  • 21
  • 29

2 Answers2

1

The best solution, that I can think about so far, is to store the known values not in a double linked list (remember, that a classic LRU cache is comprised of a double linked list to maintain element ordering and a hashmap for fast O(1) element access), but in a tree structure. A suitable tree seems to be the Countertree 2.0 library found on github: https://github.com/fjtapia/countertree_2.0

The Countertree 2.0 library provides the classical map, multimap, set and multiset variants of a tree data structure. But as that library also maintains counters at each tree node, it allows for fast O(log(n)) indexed access and distance calculation between two positions/iterators.

I want to therefore replace the double linked list of a standard LRU cache with a countertree::multiset. That multiset will actually be degenerate as all elements compare equal. New elements are always inserted at the front of the multiset. As the relative ordering of elements in a tree is never changed, all elements remain their relative position with respect to their time of insertion. If values are revisited during the building of the tree, they are unlinked from the current tree position and reinserted at the front. So each value will be in the multiset only once.

To quickly find already used values, the positions (= iterators) in the multiset are indexed by their value in a separate hashmap (very much like a classic LRU cache is comprised of a linked list and a hashmap). If a value is later found again through that hashmap, its distance to the front of the multiset is measured via the said O(log(n)) distance calculation and then it is removed from the current position of the multiset and re-inserted at the front (which is another O(log(n)) operation, which is basically visiting the same tree nodes as the distance calculation, so that it shouldn't require more memor accesses).

During decompression, values are looked up in the multiset by index, which again happens in O(log(n)) time.

So everything perfect. Or did I overlook something? Or is there an even better solution?

Kai Petzke
  • 2,150
  • 21
  • 29
1

An alternative would be to keep a bitmap of which list positions are the most recent occurrence of some element in the cache, plus a segment tree for efficient popcounts in a range. (A Fenwick tree is also an option, but then you always pay Ω(height) for queries and need a weird nonuniform storage scheme to minimize space.)

The resulting structure has

  • unordered_map from element to (index of most recent occurrence, LRU list iterator)
  • list from most to least recently used
  • segment tree as described above.

To process an element, look it up in the unordered_map. If it exists, clear the bit in the segment tree for the previous occurrence, create a list entry, and pop the back of the list if needed (clearing the index in the segment tree). Move the current element to the front of the LRU list, copy its index to the unordered_map and set that index in the segment tree.

The unordered_map and list operations are expected amortized O(1), so the segment tree drives the asymptotic cost. This cost is Θ(log n) in theory, but in practice, locality is much better than the tree. We can replace the bottom levels of the segment tree with a popcount and use a k-ary segment tree instead of 2-ary to reduce random memory accesses.

The downside is managing the space usage of the segment tree. Maybe you can afford a bitmap of the whole input, in which case, great! Otherwise, you may need to do a linear pass over the LRU list and the associated data structures to renumber the indexes consecutively while preserving order. This won't affect the amortized cost since it would only need to happen when you've processed enough elements to justify it.

David Eisenstat
  • 64,237
  • 7
  • 60
  • 120
  • Thank you very much for your comment! I am a bit worried by the Wikipedia article on segment trees, which lists their general storage requirement as O(n log(n)). It might be, that in the application here, where we don't have overlapping segments, memory usage could be lower. I even have a feeling, that in the end, for the linear case, the segment tree breaks down to what the counts in the countertree are. So maybe, both our solutions are even practically identical. But I have to think more about it. – Kai Petzke May 02 '21 at 19:28
  • @KaiPetzke yeah, that's the cost of storing n arbitrary segments. This tree isn't for point location; we're just storing sums for each of the O(n) basis segments, which will be O(n) *bits*. – David Eisenstat May 02 '21 at 19:30