1

I'm diggin in the dict structure in Python, and I'm trying to understand the implementation of compact dict with faster iteration explained here [Python-Dev] More compact dictionaries with faster iteration by Raymond Hettinger

In this message, Raymond shows how the current dict implementation is and how it can be more memory efficient. He depict dict structure like this:

d = {'timmy': 'red', 'barry': 'green', 'guido': 'blue'}

is currently stored as:

entries = [['--', '--', '--'],
           [-8522787127447073495, 'barry', 'green'],
           ['--', '--', '--'],
           ['--', '--', '--'],
           ['--', '--', '--'],
           [-9092791511155847987, 'timmy', 'red'],
           ['--', '--', '--'],
           [-6480567542315338377, 'guido', 'blue']]

Instead, the data should be organized as follows:

indices =  [None, 1, None, None, None, 0, None, 2]
entries =  [[-9092791511155847987, 'timmy', 'red'],
            [-8522787127447073495, 'barry', 'green'],
            [-6480567542315338377, 'guido', 'blue']]

My question is how does the new dict implementation performs lookup, if the indices data is numerical 0, 1, 2, as the items were entered? Is it just for clarity, and the actual value is different (eg, hash value of the key) ?

Some references I already looked Dictionaries are ordered in Python 3.6+

Chen A.
  • 10,140
  • 3
  • 42
  • 61
  • 1
    There's a link to a recipe that was included on that webpage, you should consider taking a look at it: http://code.activestate.com/recipes/578375/ – cs95 Sep 03 '17 at 18:37
  • I actually glanced over it, thanks for the suggestion I'll look into it deeper – Chen A. Sep 03 '17 at 19:09

2 Answers2

3

After doing more research, I've found the answer I look for.

Python's regular dict allocates 24 byte indices in the array (PyDictEntry). An empty dict in Python 2.7 consumes

d = dict()
import sys
sys.getsizeof(d)
272

(8 * 24 = 192 + overhead). And that's the dict entry object from the source code:

typedef struct {
    /* Cached hash code of me_key.  Note that hash codes are C longs.
     * We have to use Py_ssize_t instead because dict_popitem() abuses
     * me_hash to hold a search finger.
     */
    Py_ssize_t me_hash;  --> 8 bytes
    PyObject *me_key;  --> 8 bytes
    PyObject *me_value;  --> 8 bytes
} PyDictEntry;

With the new compact dict, the table is split to two: indexes and entries. The indexes array is of int type (for empty to small dicts) which is 8 bytes. These are references to the actual index of the entry, if there is any. If it's empty it returns None (or dummy in case there were deletions). Then the entries list contains only allocated objects.

Using my example from the question, the dict with 3 entries in Python 3.6 would consume: indices (8 bytes) + entries (3 * 24 = 72) = 80 bytes + overhead.

That's pretty neat saving for the same data with little to no impact on performance.

When lookup is performed it looks in the indexes table. Then it uses the returned value to read / append an entry to the entries list.

Chen A.
  • 10,140
  • 3
  • 42
  • 61
2

I know this question is old, but there's a talk in which Raymond Hettinger explains the details and inception of the compact dict and also briefly mentions the key sharing feature of instance dicts. https://youtu.be/p33CVV29OG8

Here you can read more about the key sharing dicts. https://www.python.org/dev/peps/pep-0412/