0

I want my Python program to be deterministic, so I have been using OrderedDicts extensively throughout the code. Unfortunately, while debugging memory leaks today, I discovered that OrderedDicts have a custom __del__ method, making them uncollectable whenever there's a cycle. It's rather unfortunate that there's no warning in the documentation about this.

So what can I do? Is there any deterministic dictionary in the Python standard library that plays nicely with gc? I'd really hate to have to roll my own, especially over a stupid one line function like this.

Also, is this something I should file a bug report for? I'm not familiar with the Python library's procedures, and what they consider a bug.

Edit: It appears that this is a known bug that was fixed back in 2010. I must have somehow gotten a really old version of 2.7 installed. I guess the best approach is to just include a monkey patch in case the user happens to be running a broken version like me.

Antimony
  • 37,781
  • 10
  • 100
  • 107
  • First, what version of Python are you using? Looking at CPython 2.7, `collections.OrderedDict` is in pure Python, and it doesn't have a custom __del__ method at all. – abarnert Oct 01 '12 at 03:18
  • 3
    Also, what exactly are you trying to accomplish here? The standard `dict` actually _is_ deterministic—run the same program over and over, and the order will always be the same. What it isn't is _predictable_ (without a whole lot of inside knowledge, or just running it once to see what you get), but why do you need the order to be predictable? – abarnert Oct 01 '12 at 03:19
  • 1
    And even if this is what you do need, you don't need to roll your own; the docs for `OrderedDict` (in both 2.7 and 3.3) have a link to an "Equivalent OrderedDict recipe that runs on Python 2.4 or later", and that has no __del__; is there a reason you can't use that? – abarnert Oct 01 '12 at 03:22
  • 1
    Finally, Python can break cycles on objects with custom `__del__` methods, it just needs help—you have to look in `gc.garbage` and break the cycle for it (and then remove it from `gc.garbage`, of course). This is generally something you don't want to do, but if you're writing cycles and depending on the gc it's something you're going to have to learn at some point. (Or, of course, you can stop writing cycles, e.g., by using `weakref`s.) – abarnert Oct 01 '12 at 03:26
  • @abarnert I'm running CPython 2.7, and my version does have a custom `__del__`. Also, doing manual garbage collection or rewriting everything to avoid cycles is a huge amount of effort. At that point it's easier to just monkey patch OrderedDict! – Antimony Oct 01 '12 at 03:36
  • Looking at the source, it's not there in the first version of `collections.py` with an `OrderedDict` in the old svn repo, or anywhere in the current Hg repo. What versions had the problem? Meanwhile, are you sure that just removing `__del__` does the right thing? I think it would be a lot safer to either borrow the 2.7.3 implementation, or take the "equivalent recipe" linked from the docs, and use that in place of the stdlib one. – abarnert Oct 01 '12 at 04:53
  • I looked it up, and it was removed in September 2010. Go to the Hg repository and look through the history of `collections.py`. There was even an official bug report and everything. – Antimony Oct 01 '12 at 04:58

3 Answers3

2

If the presence of the __del__ method is problematic for you, just remove it:

>>> import collections
>>> del collections.OrderedDict.__del__

You will gain the ability to use OrderedDicts in a reference cycle. You will lose having the OrderedDict free all its resources immediately upon deletion.

Raymond Hettinger
  • 216,523
  • 63
  • 388
  • 485
  • That's what I ended up doing. At first I was hesitant to resort to Monkey Patching due to all the potential evils it causes, but when I discovered that it was a bug already fixed in later versions, I figured that monkey patching it is completely justified. – Antimony Oct 01 '12 at 13:21
  • Also, the thing about its resources being freed immediately is a red herring. If it's being freed as part of a cycle, that means any unreachable object it refers to is about to get freed as well anyway. – Antimony Oct 01 '12 at 13:29
  • 1
    @Antimony On the other hand, if the OD is not part of a cycle, the \_\_del\_\_ does make a difference between being freed immediately and having to wait for GC to clean-up the underlying doubly-linked list. – Raymond Hettinger Oct 01 '12 at 16:33
  • Oh that it explains it. I was wondering what the rationale for even adding that method in the first place was. – Antimony Oct 01 '12 at 16:44
1

It sounds like you've tracked down a bug in OrderedDict that was fixed at some point after your version of 2.7. If it wasn't in any actual released versions, maybe you can just ignore it. But otherwise, yeah, you need a workaround.

I would suggest that, instead of monkeypatching collections.OrderedDict, you should instead use the Equivalent OrderedDict recipe that runs on Python 2.4 or later linked in the documentation for collections.OrderedDict (which does not have the excess __del__). If nothing else, when someone comes along and says "I need to run this on 2.6, how much work is it to port" the answer will be "a little less"…

But two more points:

rewriting everything to avoid cycles is a huge amount of effort.

The fact that you've got cycles in your dictionaries is a red flag that you're doing something wrong (typically using strong refs for a cache or for back-pointers), which is likely to lead to other memory problems, and possibly other bugs. So that effort may turn out to be necessary anyway.

You still haven't explained what you're trying to accomplish; I suspect the "deterministic" thing is just a red herring (especially since dicts actually are deterministic), so the best solution is s/OrderedDict/dict/g.

But if determinism is necessary, you can't depend on the cycle collector, because it's not deterministic, and that means your finalizer ordering and so on all become non-deterministic. It also means your memory usage is non-deterministic—you may end up with a program that stays within your desired memory bounds 99.999% of the time, but not 100%; if those bounds are critically important, that can be worse than failing every time.

Meanwhile, the iteration order of dictionaries isn't specified, but in practice, CPython and PyPy iterate in the order of the hash buckets, not the id (memory location) of either the value or the key, and whatever Jython and IronPython do (they may be using some underlying Java or .NET collection that has different behavior; I haven't tested), it's unlikely that the memory order of the keys would be relevant. (How could you efficiently iterate a hash table based on something like that?) You may have confused yourself by testing with objects that use id for hash, but most objects hash based on value.

For example, take this simple program:

d={}
d[0] = 0
d[1] = 1
d[2] = 2
for k in d:
  print(k, d[k], id(k), id(d[k]), hash(k))

If you run it repeatedly with CPython 2.7, CPython 3.2, and PyPy 1.9, the keys will always be iterated in order 0, 1, 2. The id columns may also be the same each time (that depends on your platform), but you can fix that in a number of ways—insert in a different order, reverse the order of the values, use string values instead of ints, assign the values to variables and then insert those variables instead of the literals, etc. Play with it enough and you can get every possible order for the id columns, and yet the keys are still iterated in the same order every time.

The order of iteration is not predictable, because to predict it you need the function for converting hash(k) into a bucket index, which depends on information you don't have access to from Python. Even if it's just hash(k) % self._table_size, unless that _table_size is exposed to the Python interface, it's not helpful. (It's a complex function of the sequence of inserts and deletes that could in principle be calculated, but in practice it's silly to try.)

But it is deterministic; if you insert and delete the same keys in the same order every time, the iteration order will be the same every time.

abarnert
  • 354,177
  • 51
  • 601
  • 671
  • GC isn't a problem for determinism because it has no side effects if you don't have finalizers (does anyone actually use these nowadays?). Could you explain what you mean by dicts being deterministic? I was under the impression that iteration order is determined by the memory addresses of objects, which differ unpredictably from run to run. – Antimony Oct 01 '12 at 13:27
  • I've edited the answer above, but briefly: Even if you only count on finalizers for implicit releases, if those releases as nondeterministic your peak memory use becomes nondeterministic. – abarnert Oct 02 '12 at 18:00
  • "Most objects" do use id for hash unless they happen to be defined otherwise. – Antimony Oct 02 '12 at 22:28
  • But most objects _are_ defined otherwise: `int`, `str`, `bytes`, `unicode`, `tuple`, `frozenset`, most immutable classes in the standard library, etc. hash by value; most mutable types aren't hashable. In fact, all types must be one or the other, unless you also compare them by identity instead of value (as with `type` or `module`). At any rate, _all_ objects order by hash bucket; in some cases that happens to involve `id`, but that doesn't mean all objects order by `id`. – abarnert Oct 03 '12 at 18:56
  • I should also mention: Python won't _stop_ you from using `id` hashing for value types that you define, but you're breaking the invariant that `a == b` implies `hash(a) == hash(b)`, and therefore dictionaries won't work properly. For example: `d={foo(3): 3}; d[foo(3)]` will probably raise a `KeyError` if `foo` is `id`-hashed. – abarnert Oct 03 '12 at 18:58
0

Note that the fix made in Python 2.7 to eliminate the __del__ method and so stop them from being uncollectable does unfortunately mean that every use of an OrderedDict (even an empty one) results in a reference cycle which must be garbage collected. See this answer for more details.

Day
  • 9,465
  • 6
  • 57
  • 93