Python behavior in replacing dictionary keys under iteration

Question

I'm looking at a piece of Python 3.6 (note that Python 3.8 throws an error instead) code below:

x = {0: None}
for i in x:
    del x[i]
    x[i+1] = None
    print(i)

This is deleting key i and adding key i + 1 to the dictionary at every iteration, so presumably it should loop forever, printing incrementing values at each iteration? However, it in fact stops iterating after printing i = 4. I was wondering why this is the behavior and what's responsible for this?

Related: https://stackoverflow.com/questions/37688512/does-iterating-over-my-dict-keys-and-modifying-the-values-in-the-dictionary-in — jarmod, Oct 18 '20 at 23:35
Related: [How to remove items from a list while iterating?](https://stackoverflow.com/questions/1207406/how-to-remove-items-from-a-list-while-iterating). — wwii, Oct 18 '20 at 23:42
Modifying a container while iterating through it leads to undefined behaviour. Because of that, newer versions cause an exception if you even try to do it. Why does it beave the way it does? Because of the way it is implemented. It could have been implemented differently. Neither version makes more or less sense. — zvone, Oct 18 '20 at 23:44

Błażej Michalik · Accepted Answer · 2021-02-14T20:41:31.627

TL;DR: It's because the dict is being resized when the size of the key table exceeds PyDict_MINSIZE, which makes the interpreter realize that the iterator has moved past the point it should've stopped on.

Read on to understand how you could've found this out on your own.

It's not possible to answer such question completely, so I'll try to explain what I've found, and, at the same time, try to equip you with the tools needed to explore on your own.

While it does come down to implementation-specific undefined behavior, it isn't hard to find out what is going on, if you know how to navigate CPython code. Here is the list of suspects based on your code:

The C structure that holds the dictionary iterator is dictiterobject
You go to the next key in the iterator using dictiter_iternextkey()
The C structure that holds the dictionary is PyDictObject.
You create a new dict using PyDict_New(), which is actually a wrapper around new_dict()
You del x[i] using PyDict_DelItem(), a wrapper around _PyDict_DelItem_KnownHash()
Adding a key is done using PyDict_SetItem(), which is a wrapper around insertdict().

First, notice that the dictionary iterator only stops when goto fail is used in dictiter_iternextkey(). This can only happen if the iterator position goes past the number of entries in the dictionary keys table (di->di_pos >= di->di_dict->ma_keys->dk_nentries, written as i >= n in the code).

Let's use GDB to see what's actually going on. First, compile CPython 3.6.10 (see the devguide for complete instructions). Run CPython under GDB, set a breakpoint in dictiter_iternextkey(), run your script, and print the di_pos and dk_nentries with every iteration:

git clone https://github.com/python/cpython
cd cpython
git checkout v3.6.10
./configure --with-pydebug
make -j 16 -s

# Put your code into weird.py

gdb ./python

(gdb) b Objects/dictobject.c:3480
(gdb) run weird.py

# Iterate these commands until process exits
(gdb) p di->di_pos
(gdb) p di->di_dict->ma_keys->dk_nentries
(gdb) c

What you'll see, is that on every iteration of your loop, di_pos and dk_nentries get incremented by one, except on the last one, where dk_nentries is being reset to 1.

Now, we'll just need to find out what is resetting the dk_nentries counter. There are two other lines in your code that could do that: del x[i] and x[i+1] = None. You could find out which one is it by reading the code, but let's use a watchpoint for that instead:

(gdb) b Objects/dictobject.c:3480
(gdb) run weird.py
(gdb) watch -l di->di_dict->ma_keys->dk_nentries
# 'c'-ontinue until the following output appears:

(gdb) c                                                                                       
Continuing.                                                                                   
                                                                                              
Hardware watchpoint 3: -location di->di_dict->ma_keys->dk_nentries                            
                                                                                              
Old value = 5                                                                                 
New value = -2604246222170760229                                                              
__memset_avx2_unaligned_erms () at ../sysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S:204
204     ../sysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S: No such file or directory.

We're now in the memory management code. The new value looks as if the old key table was freed - it's now garbage. Let's look at the backtrace, to see what code issued the free()-ing:

(gdb) bt

...
#5  0x00005555556206e6 in dictresize (mp=0x7ffff72ffaa8, minsize=<optimized out>) at Objects/dictobject.c:1314
#6  0x0000555555620751 in insertion_resize (mp=<optimized out>) at Objects/dictobject.c:1103
#7  0x0000555555620e6d in insertdict (mp=0x7ffff72ffaa8, key=5, hash=5, value=None)
#8  0x0000555555623e4a in PyDict_SetItem (op={}, key=5, value=None) at Objects/dictobject.c:1576
...

It happens when you add the key. The interpreter figures out how many entries are actually there when resizing the dictionary, and refreshes the table, including its counter. But why didn't it happen earlier?

If you look at the code that called insertion_resize(), you'll see the following branch:

        if (mp->ma_keys->dk_usable <= 0) {
            /* Need to resize. */
            if (insertion_resize(mp) < 0)
                goto Fail;
            find_empty_slot(mp, key, hash, &value_addr, &hashpos);
        }

As you can see, the PyDictKeysObject struct has a dk_usable field. As an optimization, the key table is initialized with a bit more space, so that when adding 2-3 keys, the interpreter doesn't have to resize the dictionary right away.

The amount of "free space" that is there at the beginning is controlled by PyDict_MINSIZE in PyDict_New(). This is actually mentioned in the macro section of the file. Finding out why setting it to 8 lets a dict have at most 5 entries is left as an exercise.

Check for yourself: if you recompile CPython with PyDict_MINSIZE set to 32 (it must be a power of 2), it will make your code iterate up to 20.

Python behavior in replacing dictionary keys under iteration

1 Answers1