In your comparison, you made two big mistakes. First, you neglected to even consider the idiomatic "don't delete anything, copy half the dict" option. Second, you didn't realize that deleting half the entries in a hash table at 2/3 load leaves you with a hash table of the exact same size at 1/3 load.
So, let's compare the actual choices (I'll ignore the 2/3 load to be consistent with your n/2 measures). For each one, there's the peak space, the final space, and the time:
- 2.0n, 1.0n, 1.5n: Copy, delete half the original
- 2.0n, 1.0n, 1.5n: Copy, delete half the copy
- 1.5n, 1.0n, 1.5n: Built a deletions set then delete
- 1.0n, 1.0n, 0.5n: Delete half in-place
- 1.5n, 0.5n, 1.0n: Delete half in-place, then compact
- 1.5n, 0.5n, 0.5n: Copy half
So, your proposed design would be worse than what we already do idiomatically. Either you're doubling the final (permanent) space just to save an equivalent amount of transient space, or you're taking twice as long for the same space.
And meanwhile, building a new dictionary, especially if you use a comprehension, means:
- Effectively non-mutating (automatic thread/process safety, referential transparency, etc.).
- Fewer places to make "small" mistakes that are hard to detect and debug.
- Generally more compact and more readable.
- Semantically restricted looping, dict building, and exception handling provides opportunities for optimization (which CPython takes; typically a comprehension is about 40% faster than an explicit loop).
For more information on how dictionaries are implemented in CPython, look at the source, which is comprehensively documented, and mostly pretty readable even if you're not a C expert.
If you think about how things work, some of the choices you assumed should obviously go the other way—e.g., consider that Python only stores references in containers, not actual values, and avoids malloc overhead wherever possible, so what are the odds that it would use chaining instead of open addressing?
You may also want to look at the PyPy implementation, which is in Python and has more clever tricks.
Before I respond to all of your comments, you should keep in mind that StackOverflow is not where Python changes get considered or made. If you really think something should be changed, you should post it on python-ideas, python-dev, and/or the bugs site. But before you do: You're pretty clearly still using 2.x; if you're not willing to learn 3.x to get any of the improvements or optimizations made over the past half-decade, nobody over there is going to take you seriously when you suggest additional changes. Also, familiarize yourself with the constructs you want to change; as soon as you start arguing on the basis of Python dicts probably using chaining, the only replies you're going to get will be corrections. Anyway:
Please explain to me how 'Delete half in place' takes 1.0n space and adds 1.0n space to the final space.
I can't explain something I didn't say and that isn't true. There's no "adds" anywhere. My numbers are total peak space and total final space. You're algorithm is clearly 1.0n for each. Which sounds great, until you compare it to the last two options, which have 0.5n total final space.
As your arguments in favor of not providing to the programmer the option of delete in place,
The argument not to make a change is never "that change is impossible", and rarely "that change is inherently bad", but usually "the costs of that change outweigh the benefits". The costs are obvious: there's the work involved; the added complexity of the language and each implementation; more differences between Python versions; potential TOOWTDI violations or attractive nuisances; etc. None of those things mean no change can go in; almost every change ever made to Python had almost all of those costs. But if the benefits of a change aren't worth the cost, it's not worth changing. And if the benefits are less than they initially appear because your hoped-for optimization (a) is actually a pessimization, and (b) would require giving up other benefits to use even if it weren't, that puts you a lot farther from the bar.
Also, I'm not sure, but it sounds like you believe that the idea of there being an obvious, one way to do things, and having a language designed to encourage that obvious way when possible, constitutes Python being a "nanny". If so, then you're seriously using the wrong language. There are people who hate Python for trying to get them to do things the Pythonic way, but those people are smart enough not to use Python, much less try to change it.
Your fourth point, which echoes the one presented in the mailing list about the issue, could easily be fixed … by simply providing a 'for (a,b) in mydict.iteritems() as iter', in the same way as it is currently done for file handles in a 'with open(...) as filehandle' context.
How would that "fix" anything? It sounds like the exact same semantics you could get by writing it = iter(mydict.items())
then for (a, b) in it:
. But whatever the semantics are, how would they provide the same, or equivalent, easy opportunities for compiler optimization that comprehensions provide? In a comprehension, there is only one place in the scope that you can return from. It always returns the top value already on the stack. There is guaranteed to be no exception handling in the current scope except a stereotyped StopIteration
handler. There is a very specific sequence of events in building the list
/set
/dict
that makes it safe to use generally-unsafe and inflexible opcodes that short-circuit the usual behavior. How are you expecting to get any of those optimizations, much less all of them?
"Either you're doubling the final (permanent) space just to save an equivalent amount of transient space, or you're taking twice as long for the same space." Please explain how you think this works.
This works because 1.0 is double 0.5. More concretely, a hash table that's expanded to n elements and is now at about 1/3 load is twice as big as a hash table that's expanded to n/2 elements and is now at about 2/3 load. How is this not clear?
Delete in place takes O(1) space
OK, if you want to count extra final space instead of total final space, then yes, we can say that delete in place takes 0.0n space, and copying half takes -0.5n. Shifting the zero point doesn't change the comparison.
and none of the options can take less than 1.0n time
Sorry, this was probably unclear, because here I was talking about added cost, and probably shouldn't have been, and didn't mention it. But again, changing the scale or the zero point doesn't make any difference. It clearly takes just as much time to delete 0.5n keys from one dict as it does to add 0.5n keys to another one, and all of the other steps are identical, so there is no time difference. Whether you call them both 0.5n or both 1.0n, they're still equal.
The reason I didn't consider only copying half the dictionary, is that the requirement is to actually modify the dictionary, as is clearly stated.
No, it isn't clearly stated. All you said is "I need to remove about half the elements in a set/dictionary". In 99% of use cases, d = {k: v for k, v in d.items() if pred(k)}
is the way to write that. And many of the cases people come up with where that isn't true ("but I need the background thread to see the changes immediately") are actively bad ideas. Of course there are some counterexamples, but you can't expect people to just assume you had one when you didn't even give a hint that you might.
But also, the final space of that is 1.5n, not .5n
No it isn't. The original hash table is garbage, so it gets cleaned up, so the final space is just the new, half-sized hash table. (If that isn't true, then you actually still need the original dict alongside the new one, in which case you had no choice but to copy in the first place.)
And if you're going to say, "Yeah, but until it gets cleaned up"—yes, that's why the peak space is 1.5n instead of 1.0n, because there is some non-zero time that both hash tables are alive.