64

I'd like to work with a dict in python, but limit the number of key/value pairs to X. In other words, if the dict is currently storing X key/value pairs and I perform an insertion, I would like one of the existing pairs to be dropped. It would be nice if it was the least recently inserted/accesses key but that's not completely necessary.

If this exists in the standard library please save me some time and point it out!

martineau
  • 119,623
  • 25
  • 170
  • 301
anthony
  • 40,424
  • 5
  • 55
  • 128
  • 1
    duplicate: http://stackoverflow.com/questions/1756992/removing-the-oldest-element-from-a-dictionary-in-python – Nick Dandoulakis Mar 13 '10 at 07:27
  • good find. i'd like to keep this around though since i don't specifically need lru. – anthony Mar 13 '10 at 07:34
  • @Nick: Limiting the size seems enough of a distinction to be a different question, but yes, that question is half of this. –  Mar 13 '10 at 07:34

8 Answers8

58

Python 2.7 and 3.1 have OrderedDict and there are pure-Python implementations for earlier Pythons.

from collections import OrderedDict

class LimitedSizeDict(OrderedDict):
    def __init__(self, *args, **kwds):
        self.size_limit = kwds.pop("size_limit", None)
        OrderedDict.__init__(self, *args, **kwds)
        self._check_size_limit()

    def __setitem__(self, key, value):
        OrderedDict.__setitem__(self, key, value)
        self._check_size_limit()

    def _check_size_limit(self):
        if self.size_limit is not None:
            while len(self) > self.size_limit:
                self.popitem(last=False)

You would also have to override other methods that can insert items, such as update. The primary use of OrderedDict is so you can control what gets popped easily, otherwise a normal dict would work.

martineau
  • 119,623
  • 25
  • 170
  • 301
  • @Mike: `(self, size_limit=None, *args, **kwds)` is wrong and keyword-only parameters (`(self, *args, size_limit=None, **kwds)`) aren't in current 2.x. (Are they being considered for 2.7? Regardless they aren't 2.6.) This code works with just about any version the OP is likely to use, makes *size\_limit* effectively a keyword-only parameter, and maintains the same interface as dicts. –  Mar 13 '10 at 08:28
  • 1
    Have you tested your code on Python 2.7? `dict.pop` requires at least 1 argument. `dict.popitem()` works, but it removed the recent item. – Sridhar Ratnakumar Oct 22 '10 at 23:29
  • 1
    Also, __setitem__ should check for size limit *before* actually setting the item, lest you end up losing the very item your are setting! – Sridhar Ratnakumar Oct 22 '10 at 23:34
  • @Sridhar: Looking at it now, months later, I'm not sure what I was thinking; but popitem makes more sense. –  Oct 22 '10 at 23:34
  • @Sridhar: You need to check after setting the item, as you might be replacing an existing item. –  Oct 22 '10 at 23:39
  • 5
    To be accurate, this is not really an LRU implementation. This is FIFO with deletion due to dict size limitation. In order to have a full LRU implementation, one needs to override `__contains__` method to move the last item "used" or queried to the top of the dict linked list. [I understand though, this is not the main objective of the question] – Amir Nov 08 '16 at 03:38
  • 4
    I think this answer is way too highly rated since it doesn't show how to implement any of the other methods likely also to be needed—which would need to include not just those which add/insert elements as the author mentioned, but also any that delete/remove them. – martineau Dec 28 '18 at 19:08
23

cachetools will provide you nice implementation of Mapping Hashes that does this (and it works on python 2 and 3).

Excerpt of the documentation:

For the purpose of this module, a cache is a mutable mapping of a fixed maximum size. When the cache is full, i.e. by adding another item the cache would exceed its maximum size, the cache must choose which item(s) to discard based on a suitable cache algorithm.

vaab
  • 9,685
  • 7
  • 55
  • 60
15

Here's a simple, no-LRU Python 2.6+ solution (in older Pythons you could do something similar with UserDict.DictMixin, but in 2.6 and better that's not recommended, and the ABCs from collections are preferable anyway...):

import collections

class MyDict(collections.MutableMapping):
    def __init__(self, maxlen, *a, **k):
        self.maxlen = maxlen
        self.d = dict(*a, **k)
        while len(self) > maxlen:
            self.popitem()
    def __iter__(self):
        return iter(self.d)
    def __len__(self):
        return len(self.d)
    def __getitem__(self, k):
        return self.d[k]
    def __delitem__(self, k):
        del self.d[k]
    def __setitem__(self, k, v):
        if k not in self and len(self) == self.maxlen:
            self.popitem()
        self.d[k] = v

d = MyDict(5)
for i in range(10):
    d[i] = i
    print(sorted(d))

As other answers mentioned, you probably don't want to subclass dict -- the explicit delegation to self.d is unfortunately boilerplatey but it does guarantee that every other method is properly supplied by collections.MutableMapping.

martineau
  • 119,623
  • 25
  • 170
  • 301
Alex Martelli
  • 854,459
  • 170
  • 1,222
  • 1,395
9

Here is a simple and efficient LRU cache written with dirt simple Python code that runs on any python version 1.5.2 or later:

class LRU_Cache:

    def __init__(self, original_function, maxsize=1000):
        self.original_function = original_function
        self.maxsize = maxsize
        self.mapping = {}

        PREV, NEXT, KEY, VALUE = 0, 1, 2, 3         # link fields
        self.head = [None, None, None, None]        # oldest
        self.tail = [self.head, None, None, None]   # newest
        self.head[NEXT] = self.tail

    def __call__(self, *key):
        PREV, NEXT = 0, 1
        mapping, head, tail = self.mapping, self.head, self.tail

        link = mapping.get(key, head)
        if link is head:
            value = self.original_function(*key)
            if len(mapping) >= self.maxsize:
                old_prev, old_next, old_key, old_value = head[NEXT]
                head[NEXT] = old_next
                old_next[PREV] = head
                del mapping[old_key]
            last = tail[PREV]
            link = [last, tail, key, value]
            mapping[key] = last[NEXT] = tail[PREV] = link
        else:
            link_prev, link_next, key, value = link
            link_prev[NEXT] = link_next
            link_next[PREV] = link_prev
            last = tail[PREV]
            last[NEXT] = tail[PREV] = link
            link[PREV] = last
            link[NEXT] = tail
        return value

if __name__ == '__main__':
    p = LRU_Cache(pow, maxsize=3)
    for i in [1,2,3,4,5,3,1,5,1,1]:
        print(i, p(i, 2))
Raymond Hettinger
  • 216,523
  • 63
  • 388
  • 485
  • 3
    wow, you write pretty much LRU in such a short period for various user cases, it is always a joy to read your Python code from whether python cookbook(activestate), python standard lib, blog, twitter, or pycon lectures, and now on stackoverflow too. – sunqiang Dec 01 '11 at 02:50
  • This implementation looks non-pythonic. Why not use `OrderedDict` and `MutableMapping` to do this... – Ian Chen Jun 14 '17 at 15:14
  • 4
    It's perfectly Pythonic -- a simple class with straight-forward use of dictionaries, lists, basic assignments an unpacking. The logic is self-contained and there are no external dependencies. This code is also very fast and is easily optimized further by PyPy. An OrderedDict adds space overhead (it uses two dicts internally while this only uses one) and it does unnecessary work that isn't needed for this use. MutableMapping doesn't provide any useful capabilities in this context. – Raymond Hettinger Jun 14 '17 at 16:00
  • What's the point of the class-level definitions of `PREV`, `NEXT`, `KEY`, and `VALUE` since they're not used in any of the methods? – martineau Dec 11 '18 at 16:52
  • 1
    @martineau It is there to just show the structure of the link. – Raymond Hettinger Dec 11 '18 at 17:36
  • this doesn compile self.head[NEXT] = self.tail NameError: name 'NEXT' is not defined – ealeon Jul 27 '21 at 12:27
  • @ealeon Thanks for noticing. I just rolled back to the previous version where NEXT is defined. – Raymond Hettinger Jul 28 '21 at 20:49
5

There have been many good answers, but I want to point out a simple, pythonic implementation for LRU cache. It's similar to Alex Martelli's answer.

from collections import OrderedDict, MutableMapping

class Cache(MutableMapping):
    def __init__(self, maxlen, items=None):
        self._maxlen = maxlen
        self.d = OrderedDict()
        if items:
            for k, v in items:
                self[k] = v

    @property
    def maxlen(self):
        return self._maxlen

    def __getitem__(self, key):
        self.d.move_to_end(key)
        return self.d[key]

    def __setitem__(self, key, value):
        if key in self.d:
            self.d.move_to_end(key)
        elif len(self.d) == self.maxlen:
            self.d.popitem(last=False)
        self.d[key] = value

    def __delitem__(self, key):
        del self.d[key]

    def __iter__(self):
        return self.d.__iter__()

    def __len__(self):
        return len(self.d)
Ian Chen
  • 325
  • 3
  • 11
2

You can create a custom dictionary class by subclassing dict. In your case, you would have to override __setitem__ to have check your own length and delete something if the limit is recahed. The following example would print the current lenght after every insertion:

class mydict(dict):
    def __setitem__(self, k, v):
        dict.__setitem__(self, k, v)
        print len(self)

d = mydict()
d['foo'] = 'bar'
d['bar'] = 'baz'
c089
  • 4,951
  • 2
  • 25
  • 37
  • 2
    Subclassing builtins like `dict` is often fruitless. In normal circumstances with subclassing, methods like `update` and `setdefault` would rely on the overrided `__getitem__`, but this is not how it works here. Subclassing builtins makes it easy to introduce difficult-to-see bugs. When you eliminate all such bugs, you really haven't saved any work by subclassing. – Mike Graham Mar 13 '10 at 07:32
2

A dict does not have this behavior. You could make your own class that does this, for example something like

class MaxSizeDict(object):
    def __init__(self, max_size):
        self.max_size = max_size
        self.dict = {}
    def __setitem__(self, key, value):
        if key in self.dict:
            self.dict[key] = value    
            return

        if len(self.dict) >= self.max_size:
      ...

A few notes about this

  • It would be tempting for some to subclass dict here. You can technically do this, but it is bug-prone because the methods do not depend on each other. You can use UserDict.DictMixin to save having to define all methods. There are few methods you would be able re-use if you subclass dict.
  • A dict does not know what the least recently added key is, since dicts are unordered.
    • 2.7 will introduce collections.OrderedDict, but for now keeping the keys in order separately should work fine (use a collections.deque as a queue).
    • If getting the oldest isn't all that imporant, you can just use the popitem method to delete one arbitrary item.
  • I interprettered oldest to mean first insertion, approximately. You would have to do something a bit different to eliminate the LRU items. The most obvious efficient strategy would involve keeping a doubly-linked list of keys with references to the nodes themselves stored as dict values (along with the real values). This gets more complicated and implementing it in pure Python carries a lot of overhead.
Mike Graham
  • 73,987
  • 14
  • 101
  • 130
  • You actually need something much more complex than a dict+deque if you want O(1) get-/set-/delitem, though of course this only really matters for larger sizes. –  Mar 13 '10 at 07:44
  • You can directly reuse about half of the methods from the dict, OrderedDict, or other base even without using DictMixin. It really doesn't seem that bugprone to write forwarding methods for the others, certainly no more bugprone than having to write them all yourself, as you have here. –  Mar 13 '10 at 07:48
  • I am trying to get at what hopefully solves the problem at hand. dict+deque gives you O(1) get, set, and delitem if you only care a little about the orderedness of the keys and you don't fall into the pathological case where there are many deletions and re-setting of the same keys, leading to the deque being polluted by keys not in the dict. – Mike Graham Mar 13 '10 at 07:50
  • No, for delitem you have to search the deque for the deleted item, and that's O(n). Check out one of the pure-Python OrderedDict implementations. –  Mar 13 '10 at 07:52
  • The majority of the times I've seen `dict` subclassed in the wild, the coder forgot to override all the appropriate methods. This means that errors stemming from this have the opportunity to pass silently, which is really bad. I really don't see much to be gained from subclassing in a case like this. – Mike Graham Mar 13 '10 at 07:56
  • So think about what you're doing and write some tests, same as when writing anything. Sloppy code will always have bugs. –  Mar 13 '10 at 07:57
  • If you wanted the current keys stored, you'd have to search the deque or list or whatever. However, if you leave the item there and when you pop it verify whether it is in the dict, you save the search. Whether this scales or not depends on how the instances are being used. – Mike Graham Mar 13 '10 at 07:59
  • Like I say, I don't see much benefit to subclassing here to start with. I do think that it ends up introducing a danger—things that look like they *should* work and that *would* work when subclassing a Python class will not work. I would need to see some real benefit to want to introduce something like that to code I was maintaining. – Mike Graham Mar 13 '10 at 08:01
  • Assuming update calls getitem without knowing or checking is sloppy any way you slice it, and that assumption is the only way to "look like [it] should work". You never know if you have missing functionality or not *unless you test or otherwise prove everything you want to guarantee*. -- Again, look at one of the decent pure-Python OrderedDict implementations, which do have O(1) delitem. –  Mar 13 '10 at 08:21
  • Deleting the keys from the left of the deque and sacrificing some correctness, I am not necessarily performing an O(n) operation to delete the items being eliminated. – Mike Graham Mar 13 '10 at 08:25
  • _"2.7 will introduce collections.OrderedDict"_ Python 3.1 already has OrderedDict. time for a change ? – Adrien Plisson Mar 13 '10 at 08:26
  • Very, very few people are using 3.x for writing serious code currently because of the lack of support by important libraries like numpy, PIL, twisted, and most of the web frameworks and because of the cost of porting code bases. (Especially the former reason.) – Mike Graham Mar 13 '10 at 08:29
  • @Mike: When I say delitem, I mean `__delitem__`, as in `del d[key]`. Note this O(n) search really does matter if he wants to use LRU instead of LR-added. –  Mar 13 '10 at 08:35
  • @Adrien: 3.x is a split from 2.x and development is happening in parallel with 2.x, rather than 3.x being "later". It's correct to say 2.7 and 3.1 both introduce it to their respective lines. –  Mar 13 '10 at 08:38
  • @Roger, I haven't defined what I need to do for that operation, if OP needs to support it at all. You do not have to do the O(n) search/delete on the deque/list if you don't *need* to maintain a container of the current keys like you must in an ordered dict type. – Mike Graham Mar 13 '10 at 08:42
  • @Roger: i am perfectly aware of this fact. i am just trying to convince the most people to switch to python 3. – Adrien Plisson Mar 13 '10 at 09:10
  • @Mike: ...and because libraries are lacking python 3 support, people do not use it for serious development, and we are back at the starting point. if we continue this way, people will still be using python 2 in 20 years (like people are still using C89 nowadays). – Adrien Plisson Mar 13 '10 at 09:13
  • The (not-that-scary) threat of people sticking with Python 2 in the long term does not make switching more feasible for people that need an efficient numerical math library, a web framework, a great asynchronous networking library, or a way to manipulate images. – Mike Graham Mar 13 '10 at 16:48
0

There is a library called CircularDict that implements this behaviour. It allows to limit the maximum amount of items the dict can store, but also to set memory usage limits.

It can be installed with:

pip install circular-dict

And used this way:

from circular_dict import CircularDict

# Initialize a CircularDict with a maximum length of 3
my_dict = CircularDict(maxlen=3) # You could also set maxsize_bytes=8*1024 bytes

# Fill it with 4 items
my_dict['item1'] = 'value1'
my_dict['item2'] = 'value2'
my_dict['item3'] = 'value3'
# When adding this 4th item, the 1st one will be dropped
my_dict['item4'] = 'value4'
print(circ_dict)

Ouptut will look like.

{'item2': 'value2', 'item3': 'value3', 'item4': 'value4'}
Haru Kaeru
  • 41
  • 3