Efficiently keeping track of the top-k keys of a dictionary based on value

Question

How do you efficiently keep track of the top-k keys of a dictionary with the largest values while the dictionary has its keys updated?

I've tried the naive approach of creating a sorted list from the dictionary after every update (as described in Getting key with maximum value in dictionary?), however that is very expensive and does not scale.

Real world example:

Counting the word frequency coming from an infinite stream of data. At any given moment the program may be asked to report if a word is in the current top-k most frequent values. How do we accomplish this efficiently?

collections.Counter is too slow

>>> from itertools import permutations
>>> from collections import Counter
>>> from timeit import timeit
>>> c = Counter()
>>> for x in permutations(xrange(10), 10):
    c[x] += 1


>>> timeit('c.most_common(1)', 'from __main__ import c', number=1)
0.7442058258093311
>>> sum(c.values())
3628800

It takes nearly a second to compute this value!

I am looking for O(1) time for a most_common() function. This should be doable by having another data structure that only internally stores the current top-k items, and keeps track of the current minimum-value.

@WeiYen I've thought of that, but how do you increment a value that is already in the heap? If you store all the items, you can easily get the top-N most common, but I don't see how you can update the counts efficiently when the values are constantly changing. — Wesley Baugh, Mar 15 '13 at 07:48
You could try something similar to how they update the priority in the `heapq` docs: http://docs.python.org/2/library/heapq.html#priority-queue-implementation-notes — Wei Yen, Mar 15 '13 at 08:10
@WeiYen Interesting, I didn't see that in the docs (or your comment as I was writing my answer). That approach is nice and simple, but doesn't work in the case of a constant-size heap, as we want here. On the other hand, a heap of all the unique words would (a) grow enormous, (b) take at least O(N log N) time to find the N largest elements (plus you'd probably have to replace the keys), and (c) contain an _absurd_ amount of "removed" tokens as you build up the full counter (the heap would have one entry for _every word_ you see, not just every unique word type). — Danica, Mar 15 '13 at 08:27
How many unique words do you expect? Perhaps it's possible to take advantage of the fact that the counts never decrease. To check if a word is in the top N, might be easier - you just need to find any N items that are larger. — John La Rooy, Mar 15 '13 at 08:33
@gnibbler I expect to have *at least* several million unique keys, and yes normally never decreasing values. — Wesley Baugh, Mar 15 '13 at 08:38

Danica · Answer 1 · 2013-03-15T08:33:04.843

collections.Counter.most_common does a pass over all the values, finding the N-th largest one by putting them in a heap as it goes (in, I think, O(M log N) time, where M is the total number of dictionary elements).

heapq, as suggested by Wei Yen in the comments, might work okay: parallel to the dictionary, maintain a heapq of the N largest values, and when you modify the dict check if the value is in there or should be in there now. The problem is that, as you've noted, the interface doesn't really have any way to modify the "priority" (in your case, the [negative, since it's a min-heap] number of counts) of an already-existing element.

You could modify the relevant item in-place and then run heapq.heapify to restore the heapiness. This takes a linear pass in the size of the heap (N) to find the relevant item (unless you're doing additional bookkeeping to associate elements with positions; probably not worth it), and another linear pass to re-heapify. In the case where an element wasn't in the list and now is, you'll need to add it to the heap by replacing the smallest element (in linear time, barring some additional structure).

The heapq private interface, though, includes a function _siftdown which has this comment:

# 'heap' is a heap at all indices >= startpos, except possibly for pos.  pos
# is the index of a leaf with a possibly out-of-order value.  Restore the
# heap invariant.

That sounds good! Calling heapq._siftdown(heap, 0, pos_of_relevant_idx) will fix the heap in log N time. Of course, you have to find the position of the index you're incrementing first, which takes linear time. You could potentially maintain a dictionary of elements to indices to avoid that (also keeping a pointer to the position of the smallest element), but then you'd either have to copy out the source of _siftdown and modify it to update the dictionary when it swaps things, or do a linear time pass to rebuild the dictionary afterwards (but you were just trying to avoid linear passes...).

Being careful, this should work out to O(log N) time. It turns out, though, there's something called a Fibonacci heap that does support all the operations you need, in (amortized) constant time. Unfortunately, this is one of those cases where big-O isn't the whole story; the complexities of Fibonacci heaps means that in practice, except maybe for very large heaps they're not actually faster than binary heaps. Also (perhaps "therefore"), there's not a standard Python implementation that I found in a quick googling, although the Boost C++ libraries do include one.

I'd first try using heapq, doing the linear search for the element you're changing, and calling _siftdown; this is O(N) time, as compared to O(M log N) for the Counter approach. If that turns out to be too slow, you could maintain the additional dictionary of indices and make your own version of _siftdown that updates the dict, which should end up O(log N) time. If this is still too slow (which I doubt), you could look for a Python wrapper to Boost's Fibonacci heap (or another implementation), but I really doubt this is going to be worth the hassle.

score 1 · Answer 2 · answered Mar 15 '13 at 06:52

1

Use collections.Counter it already does that for that real world example. Do you have other use cases?

answered Mar 15 '13 at 06:52

John La Rooy

295,403
53
369
502

For some reason I thought `Counter` objects couldn't be updated with additional data, but looking at the documentation it appears it can. Thanks! – Wesley Baugh Mar 15 '13 at 06:55
I've implemented this, but it's very slow. Is this the most efficient approach? – Wesley Baugh Mar 15 '13 at 07:31

Wesley Baugh · Accepted Answer · 2013-03-20T04:37:03.593

We can implement a class that keeps track of the top-k values, as I don't believe the standard library has this built-in. This would be kept up to date in parallel with the main dictionary object (probably a Counter). You could also use this as an attribute of a subclass of the main dictionary object.

Implementation

class MostCommon(object):
    """Keep track the top-k key-value pairs.

    Attributes:
        top: Integer representing the top-k items to keep track of.
        store: Dictionary of the top-k items.
        min: The current minimum of any top-k item.
        min_set: Set where keys are counts, and values are the set of
            keys with that count.
    """
    def __init__(self, top):
        """Create a new MostCommon object to track key-value paris.

        Args:
            top: Integer representing the top-k values to keep track of.
        """
        self.top = top
        self.store = dict()
        self.min = None
        self.min_set = defaultdict(set)

    def _update_existing(self, key, value):
        """Update an item that is already one of the top-k values."""
        # Currently handle values that are non-decreasing.
        assert value > self.store[key]
        self.min_set[self.store[key]].remove(key)
        if self.store[key] == self.min:  # Previously was the minimum.
            if not self.min_set[self.store[key]]:  # No more minimums.
                del self.min_set[self.store[key]]
                self.min_set[value].add(key)
                self.min = min(self.min_set.keys())
        self.min_set[value].add(key)
        self.store[key] = value

    def __contains__(self, key):
        """Boolean if the key is one of the top-k items."""
        return key in self.store

    def __setitem__(self, key, value):
        """Assign a value to a key.

        The item won't be stored if it is less than the minimum (and
        the store is already full). If the item is already in the store,
        the value will be updated along with the `min` if necessary.
        """
        # Store it if we aren't full yet.
        if len(self.store) < self.top:
            if key in self.store:  # We already have this item.
                self._update_existing(key, value)
            else:  # Brand new item.
                self.store[key] = value
                self.min_set[value].add(key)
                if value < self.min or self.min is None:
                    self.min = value
        else:  # We're full. The value must be greater minimum to be added.
            if value > self.min:  # New item must be larger than current min.
                if key in self.store:  # We already have this item.
                    self._update_existing(key, value)
                else:  # Brand new item.
                    # Make room by removing one of the current minimums.
                    old = self.min_set[self.min].pop()
                    del self.store[old]
                    # Delete the set if there are no old minimums left.
                    if not self.min_set[self.min]:
                        del self.min_set[self.min]
                    # Add the new item.
                    self.min_set[value].add(key)
                    self.store[key] = value
                    self.min = min(self.min_set.keys())

    def __repr__(self):
        if len(self.store) < 10:
            store = repr(self.store)
        else:
            length = len(self.store)
            largest = max(self.store.itervalues())
            store = '<len={length}, max={largest}>'.format(length=length,
                                                           largest=largest)
        return ('{self.__class__.__name__}(top={self.top}, min={self.min}, '
                'store={store})'.format(self=self, store=store))

Example usage

>>> common = MostCommon(2)
>>> common
MostCommon(top=2, min=None, store={})
>>> common['a'] = 1
>>> common
MostCommon(top=2, min=1, store={'a': 1})
>>> 'a' in common
True
>>> common['b'] = 2
>>> common
MostCommon(top=2, min=1, store={'a': 1, 'b': 2})
>>> common['c'] = 3
>>> common
MostCommon(top=2, min=2, store={'c': 3, 'b': 2})
>>> 'a' in common
False
>>> common['b'] = 4
>>> common
MostCommon(top=2, min=3, store={'c': 3, 'b': 4})

Access after updating values is indeed O(1)

>>> counter = Counter()
>>> for x in permutations(xrange(10), 10):
        counter[x] += 1

>>> common = MostCommon(1)
>>> for key, value in counter.iteritems():
    common[key] = value

>>> common
MostCommon(top=1, min=1, store={(9, 7, 8, 0, 2, 6, 5, 4, 3, 1): 1})
>>> timeit('repr(common)', 'from __main__ import common', number=1)
1.3251570635475218e-05

Access is O(1), but when the minimum changes during a set-item call that is a O(n) operation, where n is the number of top values. This is still better than Counter, which is O(n) during every access, where n is the size of the entire vocabulary!