Ideal data structure with fast lookup, fast update and easy comparison/sorting

Question

I am looking for a good data structure to contain a list of tuples with (hash, timestamp) values. Basically, I want to use it in the following way:

Data comes in, check to see if it's already present in the data structure (hash equality, not timestamp).
If it is, update the timestamp to "now"
If not, add it to the set with timestamp "now"

Periodically, I wish to remove and return a list of tuples that older than a specific timestamp (I need to update various other elements when they 'expire'). Timestamp does not have to be anything specific (it can be a unix timestamp, a python datetime object, or some other easy-to-compare hash/string).

I am using this to receive incoming data, update it if it's already present and purge data older than X seconds/minutes.

Multiple data structures can be a valid suggestion as well (I originally went with a priority queue + set, but a priority queue is less-than-optimal for constantly updating values).

Other approaches to achieve the same thing are welcome as well. The end goal is to track when elements are a) new to the system, b) exist in the system already and c) when they expire.

score 4 · Accepted Answer · answered Oct 09 '13 at 20:12

This is a pretty well trod space. The thing you need is two structures, You need something to tell you wether your key (hash in your case) is known to the collection. For this, dict is a very good fit; we'll just map the hash to the timestamp so you can look up each item easily. Iterating over the items in order of timestamp is a task particularly suited to Heaps, which are provided by the heapq module. Each time we see a key, we'll just add it to our heap, as a tuple of (timestamp, hash).

Unfortunately there's no way to look into a heapified list and remove certain items (because, say, they have been updated to expire later). We'll get around that by just ignoring entries in the heap that have timestamps that are dissimilar from the value in the dict.

So here's a place to start, you can probably add methods to the wrapper class to support additional operations, or change the way data is stored:

import heapq


class ExpiringCache(object):
    def __init__(self):
        self._dict = {}
        self._heap = []

    def add(self, key, expiry):
        self._dict[key] = expiry
        heapq.heappush(self._heap, (expiry, key))

    def contains(self, key):
        return key in self._dict

    def collect(self, maxage):
        while self._heap and self._heap[0][0] <= maxage:
            expiry, key = heapq.heappop(self._heap)
            if self._dict.get(key) == expiry:
                del self._dict[key]

    def items(self):
        return self._dict.items()

create a cache and add some items

>>> xc = ExpiringCache()
>>> xc.add('apples', 1)
>>> xc.add('bananas', 2)
>>> xc.add('mangoes', 3)

re-add an item with an even later expiry

>>> xc.add('apples', 4)

collect everything "older" than two time units

>>> xc.collect(2)    
>>> xc.contains('apples')
True
>>> xc.contains('bananas')
False

Nice use of the heapq - I was thinking of using it, but couldn't think of a smart way to update expiry stamps - your solution definitely works out well. Only thing I am changing is that `collect()` needs to return a list of the expired keys (ignoring those with updated timestamps) as I need to perform some post-expiry operations. Thanks! — Christian P., Oct 10 '13 at 08:12
I was trying to understand how the heapq remained sorted correctly when using a tuple as the item. I discovered that using a tuple as the item is handled by heapq as a special case, as mentioned in the docs: http://docs.python.org/2/library/heapq.html (Heap elements can be tuples. This is useful for assigning comparison values (such as task priorities) alongside the main record being tracked). — Mitch ミッチ, Dec 02 '13 at 12:32
@OniMitch: `tuple`'s are not a special case at all; tuple's define sort operations in [lexicographic](http://en.wikipedia.org/wiki/Lexicographical_order) order, and `heapq` uses the natural sort order of whatever is in the list it's asked to heapify. You could use anything at all that, when sorted, would produce elements in the correct order (heapq will just avoid doing swaps except at the list head) — SingleNegationElimination, Dec 02 '13 at 21:45
Inspiring solution. I met the same question in my work. I was thinking whether using two data structures (heap and dict) is the best, since this sounds takes more memory. Is there a way to only use the heap to do both jobs: check if a key is known to collection and iterating over the items in order of timestamp? — enaJ, Jul 08 '16 at 20:50
I guess my question above is a a trade-off between memory and time complexcity. To do check if a hash exsits, a dictionary takes O(1) and heap takes O(n). To delete or insert, a dictionary takes O(n) and heap takes O(log(n)). If the incoming stream data is not big, which can be well fit in memory, using two data structures is not a big deal, as such we prioritize the time complexcity. — enaJ, Jul 08 '16 at 21:00

Useless · Answer 2 · 2013-10-09T16:56:27.467

The closest I can think of to a single structure with the properties you want is a splay tree (with your hash as the key).

By rotating recently-accessed (and hence updated) nodes to the root, you should end up with the least recently-accessed (and hence updated) data at the leaves or grouped in a right subtree.

Figuring out the details (and implementing them) is left as an exercise for the reader ...

Caveats:

worst case height - and therefore complexity - is linear. This shouldn't occur with a decent hash
any read-only operations (ie, lookups that don't update the timestamp) will disrupt the relationship between splay tree layout and timestamp

A simpler approach is to store an object containing (hash, timestamp, prev, next) in a regular dict, using prev and next to keep an up-to-date doubly-linked list. Then all you need alongside the dict are head and tail references.

Insert & update are still constant time (hash lookup + linked-list splice), and walking backwards from the tail of the list collecting the oldest hashes is linear.

John Y · Answer 3 · 2013-10-09T20:18:38.037

Unless I'm misreading your question, a plain old dict should be ideal for everything except the purging. Assuming you are trying to avoid having to inspect the entire dictionary during purging, I would suggest keeping around a second data structure to hold (timestamp, hash) pairs.

This supplemental data structure could either be a plain old list or a deque (from the collections module). Possibly the bisect module could be handy to keep the number of timestamp comparisons down to a minimum (as opposed to comparing all the timestamps until you reach the cut-off value), but since you'd still have to iterate sequentially over the items that need to be purged, ironing out the exact details of what would be quickest requires some testing.

Edit:

For Python 2.7 or 3.1+, you could also consider using OrderedDict (from the collections module). This is basically a dict with a supplementary order-preserving data structure built into the class, so you don't have to implement it yourself. The only hitch is that the only order it preserves is insertion order, so that for your purpose, instead of just reassigning an existing entry to a new timestamp, you'll need to remove it (with del) and then assign a fresh entry with the new timestamp. Still, it retains the O(1) lookup and saves you from having to maintain the list of (timestamp, hash) pairs yourself; when it comes time to purge, you can just iterate straight through the OrderedDict, deleting entries until you reach one with a timestamp that is later than your cut-off.

It might be useful, but I it is not guaranteed that entries arrive in timestamp-order (I may get entries that are delayed and need to not be put in at the end). Otherwise a good suggestion! — Christian P., Oct 10 '13 at 11:59
@ChristianP.: Hmm... the way you describe your situation (if an incoming hash exists, update its timestamp to "now"; if the hash doesn't exist, insert it with timestamp of "now") then the only way timestamps will "arrive" out of order is if you have previously existing data (initial state) which is out of order. Then can't you just sort that data by timestamp first, before accepting new data? There's an initial cost, yes, but smooth sailing after that. — John Y, Oct 10 '13 at 13:09

Cameron Sparr · Answer 4 · 2013-10-09T16:14:31.150

1

If you're okay with working around occasional false positives, I think that a bloom filter may suit your needs well (it's very very fast)

http://en.wikipedia.org/wiki/Bloom_filter

and a python implementation: https://github.com/axiak/pybloomfiltermmap

EDIT: reading your post again, I think this will work, but instead of storing the hashes, just let the bloomfilter create the hashes for you. ie, I think you just want to use the bloomfilter as a set of timestamps. I'm assuming that your timestamps could basically just be a set since you are hashing them.

edited Oct 09 '13 at 16:14

answered Oct 09 '13 at 15:48

Cameron Sparr

3,925
2
22
31

1

It's not clear from the question that the hash is based on the timestamp - it sounds like two independent elements to me. – Useless Oct 09 '13 at 16:27
@Useless: Actually, I would say from the question that it's pretty clear that the hash is **not** based on the timestamp. (I'm agreeing with you, but much stronger.) – John Y Oct 09 '13 at 16:32
You are correct, I guess I should have just called it a string as in this context it does not matter that it is a hash - the timestamp and hash/string are paired, but the hash/string is the identifier. – Christian P. Oct 10 '13 at 08:09

AShelly · Answer 5 · 2013-10-10T19:26:52.263

A simple hashtable or dictionary will be O(1) for the check/update/set operations. You could simultaneously store the data in a simple time-ordered list where for the purge operations. Keep a head and tail pointer, so that insert is also O(1) and removal is as simple as advancing the head until it reaches the target time and removing all the entries you find from the hash.

The overhead is one extra pointer per stored data item, and the code is dead simple:

insert(key,time,data):
  existing = MyDictionary.find(key)
  if existing:  
      existing.mark()
  node = MyNodeType(data,time)  #simple container holding args + 'next' pointer
  node.next = NULL
  MyDictionary.insert(key,node)
  Tail.next = node
  if Head is NULL:  Head = node

clean(olderThan):
  while Head.time < olderThan:
    n = Head.next 
    if not Head.isMarked():  
        MyDictionary.remove(Head.key)
    #else it was already overwritten
    if Head == Tail: Tail = n
    Head = n

With a singly-linked list, how do you handle the update-in-place requirement? — Useless, Oct 09 '13 at 18:56
Actually it changes very slightly. The list is in time order. On a new timestamp for the same key, the new node will replace the old one in the dictionary. The old node remains in the list, where it will be cleaned up when it expires. Flag the old ones when replacing so they are not cleaned from the hash. — AShelly, Oct 10 '13 at 19:24

Ideal data structure with fast lookup, fast update and easy comparison/sorting

5 Answers5

Linked