Why hasn't iter.remove been implemented in python dicts?

Question

Is there a good reason that iter.remove() is not currently implemented in python dicts?

Let us say I need to remove about half the elements in a set/dictionary. Then I'm forced to either:

Copy the entire set/dictionary (n space, n time)
Iterate over the copy to find elements to remove, remove it from the original dictionary (n/2 plus n/2 distinct lookups)

Or:

Iterate over the dictionary, add elements to remove to a new set (n space, n time)
Iterate over the new set, removing each element from the original dictionary (n/2 plus n/2 lookups)

While asymptotically everything is still "O(n)" time, this is horribly inefficient and about 3 times as slow when compared to the sane way of doing this:

Iterate over the dict, removing what you don't want as you go. This is truly n time, and O(1) space.

At least under the common implementation of hash sets as buckets of linked lists, the iterator should be able to remove the element it just visited without making a new lookup, by simply removing the node in the linked list.

More importantly, the bad solution also requires O(n) space, which really is bad even for those who tend to dismiss these kinds of optimization concerns in python.

A dict comprehension will allow you to do basically what you're looking for, but you still have to create a new dict. — Chris Arena, Aug 05 '14 at 02:42
Wouldn't solution #2, step 1 be n/2 space? The only additional space you're using is to build the set. — dano, Aug 05 '14 at 03:10
Also, python dicts are implemented with an array, and use [open addressing](http://en.wikipedia.org/wiki/Open_addressing) to avoid hash collisions, rather than the bucket of linked lists approach. — dano, Aug 05 '14 at 03:14
You are right, I probably meant O(n). I wasn't aware of the open addressing implementation, but I still don't see how this would prevent an iter.remove implementation — ealfonso, Aug 05 '14 at 03:30
O(n) space generally isn't something we worry about in Python. Plenty of things we do use O(n) space that could be implemented in O(1); for example, `sort` and `sorted` both use O(n) space, and using list comprehensions when we don't need the original list uses O(n) space. — user2357112, Aug 05 '14 at 04:49
There was discussion about adding this feature to Python 3.x in 2006. I haven't made it through the entire chain, but [this mail](https://mail.python.org/pipermail/python-3000/2006-March/000144.html) Seems to have ended most of the discussion around adding a `delete` method. — dano, Aug 05 '14 at 04:56
@user2357112 Doesn't sorted return a new list anyway? I have to look into what you say about sort using n space, but if it does, that would be bad. But if "we" don't worry about space, why do you think python offers dict.iterkeys, iteritems, itervalues? xrange vs range? Also, what is worrisome is something that should be left to the programmer whenever possible. For example, what if your application needs frequently do this operation in some inner lop on a really large dict? That's probably something you'd want to worry about. — ealfonso, Aug 05 '14 at 05:01
The reasons for not pursuing it are even more clearly stated [here](https://mail.python.org/pipermail/python-3000/2006-March/000157.html). Scroll down to the last section. — dano, Aug 05 '14 at 05:13
@dano Thanks for your links. The second one more or less repeats what is stated in the first link, which unfortunately contains a bunch of contradictions and fallacies. — ealfonso, Aug 05 '14 at 05:16
@erjoalgo I'd encourage you to take your question to the python-list or python-ideas mailing list and get a discussion going there, if the reasoning from that thread doesn't make sense to you. It sounds like there was at least initially support for the idea. — dano, Aug 05 '14 at 05:18
Python 2 range is [pseudopolynomial](http://en.wikipedia.org/wiki/Pseudo-polynomial_time); it's O(n) where n is the value of the input, but O(2^l) where l in the input's size. That's *definitely* something to worry about. `sort` uses O(n) space because it's a mergesort; you can read about the details in [`listsort.txt`](http://hg.python.org/cpython/file/5e310c6a8520/Objects/listsort.txt). It is true that avoiding O(n) wasted space is one of the primary benefits of the `iter*` dict methods; I have some idea of why those were judged to be worth adding when an iterator `remove` method wasn't, ... — user2357112, Aug 05 '14 at 05:20
but I'm not familiar enough with the discussions that drive changes in Python to say why definitively. — user2357112, Aug 05 '14 at 05:23

abarnert · Accepted Answer · 2014-08-13T21:08:42.940

In your comparison, you made two big mistakes. First, you neglected to even consider the idiomatic "don't delete anything, copy half the dict" option. Second, you didn't realize that deleting half the entries in a hash table at 2/3 load leaves you with a hash table of the exact same size at 1/3 load.

So, let's compare the actual choices (I'll ignore the 2/3 load to be consistent with your n/2 measures). For each one, there's the peak space, the final space, and the time:

2.0n, 1.0n, 1.5n: Copy, delete half the original
2.0n, 1.0n, 1.5n: Copy, delete half the copy
1.5n, 1.0n, 1.5n: Built a deletions set then delete
1.0n, 1.0n, 0.5n: Delete half in-place
1.5n, 0.5n, 1.0n: Delete half in-place, then compact
1.5n, 0.5n, 0.5n: Copy half

So, your proposed design would be worse than what we already do idiomatically. Either you're doubling the final (permanent) space just to save an equivalent amount of transient space, or you're taking twice as long for the same space.

And meanwhile, building a new dictionary, especially if you use a comprehension, means:

Effectively non-mutating (automatic thread/process safety, referential transparency, etc.).
Fewer places to make "small" mistakes that are hard to detect and debug.
Generally more compact and more readable.
Semantically restricted looping, dict building, and exception handling provides opportunities for optimization (which CPython takes; typically a comprehension is about 40% faster than an explicit loop).

For more information on how dictionaries are implemented in CPython, look at the source, which is comprehensively documented, and mostly pretty readable even if you're not a C expert.

If you think about how things work, some of the choices you assumed should obviously go the other way—e.g., consider that Python only stores references in containers, not actual values, and avoids malloc overhead wherever possible, so what are the odds that it would use chaining instead of open addressing?

You may also want to look at the PyPy implementation, which is in Python and has more clever tricks.

Before I respond to all of your comments, you should keep in mind that StackOverflow is not where Python changes get considered or made. If you really think something should be changed, you should post it on python-ideas, python-dev, and/or the bugs site. But before you do: You're pretty clearly still using 2.x; if you're not willing to learn 3.x to get any of the improvements or optimizations made over the past half-decade, nobody over there is going to take you seriously when you suggest additional changes. Also, familiarize yourself with the constructs you want to change; as soon as you start arguing on the basis of Python dicts probably using chaining, the only replies you're going to get will be corrections. Anyway:

Please explain to me how 'Delete half in place' takes 1.0n space and adds 1.0n space to the final space.

I can't explain something I didn't say and that isn't true. There's no "adds" anywhere. My numbers are total peak space and total final space. You're algorithm is clearly 1.0n for each. Which sounds great, until you compare it to the last two options, which have 0.5n total final space.

As your arguments in favor of not providing to the programmer the option of delete in place,

The argument not to make a change is never "that change is impossible", and rarely "that change is inherently bad", but usually "the costs of that change outweigh the benefits". The costs are obvious: there's the work involved; the added complexity of the language and each implementation; more differences between Python versions; potential TOOWTDI violations or attractive nuisances; etc. None of those things mean no change can go in; almost every change ever made to Python had almost all of those costs. But if the benefits of a change aren't worth the cost, it's not worth changing. And if the benefits are less than they initially appear because your hoped-for optimization (a) is actually a pessimization, and (b) would require giving up other benefits to use even if it weren't, that puts you a lot farther from the bar.

Also, I'm not sure, but it sounds like you believe that the idea of there being an obvious, one way to do things, and having a language designed to encourage that obvious way when possible, constitutes Python being a "nanny". If so, then you're seriously using the wrong language. There are people who hate Python for trying to get them to do things the Pythonic way, but those people are smart enough not to use Python, much less try to change it.

Your fourth point, which echoes the one presented in the mailing list about the issue, could easily be fixed … by simply providing a 'for (a,b) in mydict.iteritems() as iter', in the same way as it is currently done for file handles in a 'with open(...) as filehandle' context.

How would that "fix" anything? It sounds like the exact same semantics you could get by writing it = iter(mydict.items()) then for (a, b) in it:. But whatever the semantics are, how would they provide the same, or equivalent, easy opportunities for compiler optimization that comprehensions provide? In a comprehension, there is only one place in the scope that you can return from. It always returns the top value already on the stack. There is guaranteed to be no exception handling in the current scope except a stereotyped StopIteration handler. There is a very specific sequence of events in building the list/set/dict that makes it safe to use generally-unsafe and inflexible opcodes that short-circuit the usual behavior. How are you expecting to get any of those optimizations, much less all of them?

"Either you're doubling the final (permanent) space just to save an equivalent amount of transient space, or you're taking twice as long for the same space." Please explain how you think this works.

This works because 1.0 is double 0.5. More concretely, a hash table that's expanded to n elements and is now at about 1/3 load is twice as big as a hash table that's expanded to n/2 elements and is now at about 2/3 load. How is this not clear?

Delete in place takes O(1) space

OK, if you want to count extra final space instead of total final space, then yes, we can say that delete in place takes 0.0n space, and copying half takes -0.5n. Shifting the zero point doesn't change the comparison.

and none of the options can take less than 1.0n time

Sorry, this was probably unclear, because here I was talking about added cost, and probably shouldn't have been, and didn't mention it. But again, changing the scale or the zero point doesn't make any difference. It clearly takes just as much time to delete 0.5n keys from one dict as it does to add 0.5n keys to another one, and all of the other steps are identical, so there is no time difference. Whether you call them both 0.5n or both 1.0n, they're still equal.

The reason I didn't consider only copying half the dictionary, is that the requirement is to actually modify the dictionary, as is clearly stated.

No, it isn't clearly stated. All you said is "I need to remove about half the elements in a set/dictionary". In 99% of use cases, d = {k: v for k, v in d.items() if pred(k)} is the way to write that. And many of the cases people come up with where that isn't true ("but I need the background thread to see the changes immediately") are actively bad ideas. Of course there are some counterexamples, but you can't expect people to just assume you had one when you didn't even give a hint that you might.

But also, the final space of that is 1.5n, not .5n

No it isn't. The original hash table is garbage, so it gets cleaned up, so the final space is just the new, half-sized hash table. (If that isn't true, then you actually still need the original dict alongside the new one, in which case you had no choice but to copy in the first place.)

And if you're going to say, "Yeah, but until it gets cleaned up"—yes, that's why the peak space is 1.5n instead of 1.0n, because there is some non-zero time that both hash tables are alive.

Please explain to me how 'Delete half in place' takes 1.0n space and adds 1.0n space to the final space. As your arguments in favor of not providing to the programmer the *option* of delete in place, the first three of them make no sense unless you see python as a 'nany' for the programmer. Whether you think there's an increased chance of concurrency issues and mistakes, that is something that should be left to the users' own risk, not restricted to them. Your fourth point, which echoes the one presented in the mailing list about the issue, could easily be fixed — ealfonso, Aug 13 '14 at 17:02
Could be fixed by simply providing a 'for (a,b) in mydict.iteritems() as iter', in the same way as it is currently done for file handles in a 'with open(...) as filehandle' context. — ealfonso, Aug 13 '14 at 17:05
"Either you're doubling the final (permanent) space just to save an equivalent amount of transient space, or you're taking twice as long for the same space." Please explain how you think this works. Delete in place takes O(1) space and none of the options can take less than 1.0n time, since it is necessary to consider each entry in the dict. Maybe it isn't clear to you what the issue is about. An example of what I'm talking about is in java's iter.remove method, http://stackoverflow.com/questions/1196586/calling-remove-in-foreach-loop-in-java. — ealfonso, Aug 13 '14 at 17:12
The reason I didn't consider only copying half the dictionary, is that the requirement is to actually modify the dictionary, as is clearly stated. So your, 'Copy half' option is not available. But also, the final space of that is 1.5n, not .5n — ealfonso, Aug 13 '14 at 17:28
@erjoalgo: You've posted 4 comments which are together longer than my answer, I'll edit the answer instead of trying to respond in comments. So bear with me… — abarnert, Aug 13 '14 at 19:44
I posted 4 comments because your answer shows that you don't really understand the question. Also, just to be precise, your answer is longer than my comments. — ealfonso, Aug 13 '14 at 21:04
@erjoalgo: You clearly think that anyone who doesn't agree with you completely just doesn't understand the question, rather than even considering the possibility that maybe there are legitimate reasons for differences of opinion, or maybe you're even wrong. — abarnert, Aug 13 '14 at 21:16

score 0 · Answer 2 · answered Mar 17 '17 at 15:44

There is another approach:

for key in list(mydict.keys()):
    val = mydict[key]
    if <decide drop>(val):
        mydict.pop(key)

Which could be explained as:

Copy the keys of the original dictionary
Iterate the dictionary through individual lookups
Delete elements when required

I suspect that the overhead of invidual lookups will be too high, comparing to the straightforward iteration. But, I am curious (and have not tested it yet).

Why hasn't iter.remove been implemented in python dicts?

2 Answers2