31

UPDATED based on Lennart Regebro's answer

Suppose you iterate through a dictionary, and sometimes need to delete an element. The following is very efficient:

remove = []
for k, v in dict_.items():
  if condition(k, v):
    remove.append(k)
    continue
  # do other things you need to do in this loop
for k in remove:
  del dict_[k]

The only overhead here is building the list of keys to remove; unless it grows large compared to the dictionary size, it's not an issue. However, this approach requires some extra coding, so it's not very popular.

The popular dict comprehension approach:

dict_ = {k : v for k, v in dict_ if not condition(k, v)}
for k, v in dict_.items():
  # do other things you need to do in this loop

results in a full dictionary copy, and so has the risk of a silly performance hit if dictionaries grow large or the containing function is called often.

A much better approach is to copy the keys only rather than whole dictionary:

for k in list(dict_.keys()):
  if condition(k, dict_[k]):
    del dict_[k]
    continue
  # do other things you need to do in this loop       

(Note that all code examples are in Python 3, so keys(), items() returns a view, not a copy.)

In most cases, it won't hurt performance that much, since the time to check even the simplest condition (not to mention other stuff you're doing in the loop) is usually greater than the time to add one key to a list.

Still, I am wondering if it's possible to avoid even that with a custom dictionary that allows deletions while iterating:

for k, v in dict_.items():
  if condition(k, v):
    del dict_[k]
    continue
  # do other things you need to do in this loop

Perhaps an iterator could always look ahead, so that when the __next__ is called, the iterator knows where to go without even looking at the current element (it would only need to look at the element when it first gets to it). And if there is no next element, the iterator could just set the flag that would cause StopIteration exception raised whenever __next__ is called again.

If the element the iterator tries to advance to turns out to be deleted, it's fine to raise an exception; there is no need to support deletions while multiple iterations are going on simultaneously.

Are there any problems with this approach?

One problem is that I'm not sure it can be done with no material overhead compared to the existing dict; otherwise, it would be faster to use the list(dict_) approach!

UPDATE:

I tried all the versions. I don't report the timing, since they are clearly very dependent on the exact situation. But it seems safe to say that in many cases, the fastest approach is likely to be list(dict_). After all, if you think about, the copy is the fastest operation that grows linearly with size of the list; almost any other overhead, as long as it's also proportional to the list size, is likely to be bigger.

I really like all the ideas, but since I have to select only one, I'm accepting the context manager solution since it allows to use the dictionary as either normal or "enhanced" with very small code changes.

max
  • 49,282
  • 56
  • 208
  • 355

8 Answers8

17

As you note, you can store the items to delete somewhere and defer the deletion of them until later. The problem then becomes when to purge them and how to make sure that the purge method eventually gets called. The answer to this is a context manager which is also a subclass of dict.

class dd_dict(dict):    # the dd is for "deferred delete"
    _deletes = None
    def __delitem__(self, key):
        if key not in self:
            raise KeyError(str(key))
        dict.__delitem__(self, key) if self._deletes is None else self._deletes.add(key)
    def __enter__(self):
        self._deletes = set()
    def __exit__(self, type, value, tb):
        for key in self._deletes:
            try:
                dict.__delitem__(self, key)
            except KeyError:
                pass
        self._deletes = None

Usage:

# make the dict and do whatever to it
ddd = dd_dict(a=1, b=2, c=3)

# now iterate over it, deferring deletes
with ddd:
    for k, v in ddd.iteritems():
        if k is "a":
            del ddd[k]
            print ddd     # shows that "a" is still there

print ddd                 # shows that "a" has been deleted

If you're not in a with block, of course, deletes are immediate; as this is a dict subclass, it works just like a regular dict outside of a context manager.

You could also implement this as a wrapper class for a dictionary:

class deferring_delete(object):
    def __init__(self, d):
        self._dict = d
    def __enter__(self):
        self._deletes = set()
        return self
    def __exit__(self, type, value, tb):
        for key in self._deletes:
            try:
                del self._dict[key]
            except KeyError:
                pass
        del self._deletes
    def __delitem__(self, key):
        if key not in self._dict:
            raise KeyError(str(key))
        self._deletes.add(key)

d = dict(a=1, b=2, c=3)

with deferring_delete(d) as dd:
    for k, v in d.iteritems():
        if k is "a":
            del dd[k]    # delete through wrapper

print d

It's even possible to make the wrapper class fully functional as a dictionary, if you want, though that's a fair bit more code.

Performance-wise, this is admittedly not such a win, but I like it from a programmer-friendliness standpoint. The second method should be very slightly faster since it's not testing a flag on each delete.

kindall
  • 178,883
  • 35
  • 278
  • 309
  • Thanks. It works, and is a great example for me to work through. Unfortunately, it's slower than making a copy of the keys in my application; presumably Python overhead when doing things like this is just too big. – max Feb 06 '12 at 09:10
  • Updated this a bit with some error handling and simplifications. – kindall Sep 19 '12 at 16:30
  • Exceedingly clever use of context management. Robust, too. I concur with both [max](https://stackoverflow.com/users/336527/max) and [Lennart Regebro](https://stackoverflow.com/users/126214/lennart-regebro), however: the [`list(dict_)` approach](https://stackoverflow.com/a/9023812/2809027) is sufficiently simple _and_ efficient that there's not much incentive to pursue complex alternatives. Nonetheless, **mandatory upvotes for Pythonic black magic.** – Cecil Curry Feb 22 '16 at 06:56
8

What you need to do is to not modify the list of keys you iterating over. You can do this in three ways:

  1. Make a copy of the keys in a separate list and iterate over that. You can then safely delete the keys in the dictionary during iteration. This is the easiest, and fastest, unless the dictionary is huge in which case you should start thinking about using a database in any case. Code:

    for k in list(dict_):
      if condition(k, dict_[k]):
        del dict_[k]
        continue
      # do other things you need to do in this loop
    
  2. Make a copy not of the keys you are iterating over, but a copy of the keys you are to delete. In other words, don't delete these keys while iterating instead add them to a list, then delete the keys in that list once you are finished iterating. This is slightly more complicated than 1. but much less than 3. It is also fast. This is what you do in your first example.

    delete_these = []
    for k in dict_:
      if condition(k, dict_[k]):
        delete_these.append(k)
        continue
      # do other things you need to do in this loop
    
    for k in delete_these:
        del dict_[k]
    
  3. The only way to avoid making some sort of new list is, as you suggest, to make a special dictionary. But that requires when you delete keys it does not actually delete the keys, but only mark them as deleted, and then delete them for real only once you call a purge method. This requires quite a lot of implementation and there are edge-cases and you'll fudge yourself by forgetting to purge, etc. And iterating over the dictionary must still include the deleted keys, which will bite you at some point. So I wouldn't recommend this. Also, however you implement this in Python, you are likely to just once again end up with a list of things to delete, so it's likely to just be a complicated and error prone version of 2. If you implement it in C, you could probably get away with the copying by adding the flags directly into the hash-key structure. But as mentioned, the problems really overshadow the benefits.

Lennart Regebro
  • 167,292
  • 41
  • 224
  • 251
  • Yes.. It turns out that copying keys from `.keys()` into a list is very cheap compared to checking the condition. I am struggling to see if it can ever result in more than ~20% overhead, even in the worst case. And a custom dictionary with no overhead (and bugs) is hard to imagine. – max Jan 26 '12 at 19:50
4

You can accomplish this by iterating over a static list of the key/value pairs of the dictionary, instead of iterating over a dictionary view.

Basically, iterating over list(dict_.items()) instead of dict_.items() will work:

for k, v in list(dict_.items()):
  if condition(k, v):
    del dict_[k]
    continue
  # do other things you need to do in this loop

Here is an example (ideone):

dict_ = {0: 'a', 1: 'b', 2: 'c', 3: 'd', 4: 'e', 5: 'f', 6: 'g'}
for k, v in list(dict_.items()):
    if k % 2 == 0:
        print("Deleting  ", (k, v))
        del dict_[k]
        continue
    print("Processing", (k, v))

and the output:

Deleting   (0, 'a')
Processing (1, 'b')
Deleting   (2, 'c')
Processing (3, 'd')
Deleting   (4, 'e')
Processing (5, 'f')
Deleting   (6, 'g')
Andrew Clark
  • 202,379
  • 35
  • 273
  • 306
  • 2
    But again, this requires a copy. –  Jan 26 '12 at 18:45
  • That is true, but I expect that any method of iterating that allows deletion will require iterating over something static, which implies a copy. Maybe someone can prove me wrong with some clever implementation of a custom dictionary. – Andrew Clark Jan 26 '12 at 18:57
  • @F.J: actually, your approach is a lot faster than the `dict` comprehension. I believe this is because creating the dictionary structure is quite expensive (even though the values are linked, not copied). While the `dict` comprehension was 5 times slower than the `remove` loop in my test, your approach is only 20% slower. Still, I can imagine situations where it would be undesirable. – max Jan 26 '12 at 19:13
  • @F.J : oops as is, it's still quite slow in my test (3 times slower versus the `remove` loop). It was only very fast when I used `keys()` rather than `items()`, and looked up values by `dict_[k]`. – max Jan 26 '12 at 19:40
  • @F.J.: linked lists have the property that you could change them during iteration i.e., they are not static. See [example with `OrderedDict` in my answer](http://stackoverflow.com/a/9024146/4279) – jfs Jan 26 '12 at 19:43
3

Naive implementation for Python 2.x and 3.x:

import sys
from collections import deque


def _protect_from_delete(func):
    def wrapper(self, *args, **kwargs):
        try:
            self._iterating += 1
            for item in func(self, *args, **kwargs):
                yield item
        finally:
            self._iterating -= 1
            self._delete_pending()
    return wrapper

class DeletableDict(dict):
    def __init__(self, *args, **kwargs):
        super(DeletableDict, self).__init__(*args, **kwargs)
        self._keys_to_delete = deque()
        self._iterating = 0

    if sys.version_info[0] != 3:
        iterkeys = _protect_from_delete(dict.iterkeys)
        itervalues = _protect_from_delete(dict.itervalues)
        iteritems = _protect_from_delete(dict.iteritems)
    else:
        keys = _protect_from_delete(dict.keys)
        values = _protect_from_delete(dict.values)
        items = _protect_from_delete(dict.items)  
    __iter__ = _protect_from_delete(dict.__iter__)

    def __delitem__(self, key):
        if not self._iterating:
            return super(DeletableDict, self).__delitem__(key)
        self._keys_to_delete.append(key)

    def _delete_pending(self):
        for key in self._keys_to_delete:
            super(DeletableDict, self).__delitem__(key)
        self._keys_to_delete.clear()

if __name__ == '__main__':
    dct = DeletableDict((i, i*2) for i in range(15))
    if sys.version_info[0] != 3:
        for k, v in dct.iteritems():
            if k < 5:
                del dct[k]
        print(dct)
        for k in dct.iterkeys():
            if k > 8:
                del dct[k]
        print(dct)
        for k in dct:
            if k < 8:
                del dct[k]
        print(dct)
    else:
        for k, v in dct.items():
            if k < 5:
                del dct[k]
        print(dct)

When iterating over keys, items or values it sets flag self._iterating. In __delitem__ it checks for ability to delete item, and stores keys in temporary queue. At the end of iterations it deletes all pending keys.

It's very naive implementation, and I wouldn't recommend to use it in production code.

EDIT

Added support for Python 3 and improvements from @jsbueno comments.

Python 3 run on Ideone.com

Community
  • 1
  • 1
reclosedev
  • 9,352
  • 34
  • 51
  • 1
    Nice implementation - but the O.P asked explictly for a Python 3 version - One should only need to change the wrapped methods in `__init__` to get this to work in Python 3. That apart, I don't know if this works as all, as it replaces the "dunder" (magic "`__xxx__`" ) methods in the instance - these are usually ignored - normally one has to replace then on the class itself, not on the instance. – jsbueno Jan 27 '12 at 02:15
  • btw, this would be my approach - if this code is make towork on python 3, it should be the "correct" answer to this question. – jsbueno Jan 27 '12 at 02:16
  • Thanks. I will play with this to see if I can speed it up; right now, the copy of the keys is still the fastest approach in my specific situation. – max Feb 06 '12 at 09:12
3

Python 3.2 has such dict in the stdlib:

#!/usr/bin/env python3
from collections import OrderedDict as odict

d = odict(zip(range(3), "abc"))
print(d)
for k in d:
    if k == 2:
       del d[k]
print(d)

Output

OrderedDict([(0, 'a'), (1, 'b'), (2, 'c')])
OrderedDict([(0, 'a'), (1, 'b')])

Iteration is performed over a linked list, see __iter__() method implementation. The deletion is safe (in Python 3.2) even though items are weak references.

jfs
  • 399,953
  • 195
  • 994
  • 1,670
  • Wow! I use them all the time, never knew they are del-safe. Is it guaranteed though or just implementation-dependent? The documentation doesn't seem to promise that. – max Jan 26 '12 at 19:58
  • @max: `linked list` is an implementation detail. Special care needed to allow deletion during iteration. I might be mistaken about how it works. – jfs Jan 26 '12 at 20:28
  • 2
    It uses 5 times as much memory and takes twice as long to delete keys though... ;-) http://pastebin.com/FK9F4G4m – Lennart Regebro Jan 26 '12 at 20:28
0

A slightly different approach; sometimes deletion is overrated. While iterating, you can override the value in the dictionary and assign it to None. This does not "change" the overall structure, it just re-points one elements to None. This can be done safely while iterating. If you really must, you could delete the Nones afterwards (assuming you never perviously stored Nones) or just have your code tolerate retreiveing Nones as if the key wasn't there in the first place.

The whole conversation here really revolves around the size of the dictionary and the expected ratio of the elements you want to delete. Get flat on that, and one of the the solutions presented in these answers will be the "right one" for your specific use-case.

dsz
  • 4,542
  • 39
  • 35
0
  1. You can make a copy of the list of keys (you don't need to copy te values) at the beginning of the iteration, and iterate over those (checking that the key is there). This is inefficient if there ar a lot of keys.
  2. You can arrange embed your first example code inside a class. __iter__ and __delitem__ and other special methods need to collaborate to keep a list of items to be removed while an iteration happens. When there are no current iterations, __delitem__ can just delete an item, but when at least one iteration is happening, it should just add the key to be deleted into a list. When the last active iteration finishes, it should actually delete things. This somewhat inefficient if there's a lot of keys to remove, and will, of course, blow up if there's always at least one iteration going on.
  • About your case 2: a. It's really just a variation of his first example, with the keys to delete in a separate list. b. What happens if you don't exhaust the iteration? Then the list won't get purged... – Lennart Regebro Jan 26 '12 at 19:13
0

This could work as a compromise between the two examples - two lines longer than the second one, but shorter and slightly faster than the first. Python 2:

dict_ = {k : random.randint(0, 40000) for k in range(0,200000)}

dict_remove = [k for k,v in dict_.iteritems() if v < 3000]
for k in dict_remove:
    del dict_[k]

Split into a function and it's down to one line each call (whether this is more readable or not is your call):

def dict_remove(dict_, keys):
    for k in keys:
        del dict_[k]

dict_remove(dict_, [k for k,v in dict_.iteritems() if v < 3000])

Regardless of where the code is stored, you'll have to store the keys needing deletion somewhere. The only way around that is using generator expressions, which will explode the moment you delete a key for the first time.

ThreeHams
  • 125
  • 1
  • 7