1

I find I am accessing Python dicts using static strings very frequently and in nested loops. I'm guessing that Python needs to hash the strings for every access, which could be impacting performance.

For example:

for d in large_list_of_dicts:
    for d2 in d['trees']:
        v = d2['leaves']

Does Cython hash these static string keys to hash values (only once) and reuse these hashes? If so, might this significantly improve performance of such loops?

  • 4
    I'm not familiar with Cython, but even in regular Python a hash will be computed for a given string object at most once - it gets cached in the object. I can't imagine that Cython would have thrown away such a basic optimization. – jasonharper Mar 06 '19 at 14:29
  • @jasonharper Thanks, I didn't realise that. So it sounds like there'd be little to gain in using Cython here. – Incompetent Perfectionist Mar 06 '19 at 14:36

1 Answers1

1

Cython will not perform any magic here: it will just delegate the calls to PyDict_GetItemWithError - basically the same thing the Python interpreter will do (but probably slightly faster).

However, an unicode-object (I assume we are speaking about Python3-strings) caches its hash-value (in PyUnicodeObject.hash-member field), so this needs to be calculated only once - which makes sense because an unicode-object is immutable, that means the hash cannot change.

Here is the CPython code responsible for the hash calculation/caching:

#define _PyUnicode_HASH(op)                             \
(((PyASCIIObject *)(op))->hash)
...

static Py_hash_t
unicode_hash(PyObject *self)
{
    ...
    // if hash already calculated, return cached value
    if (_PyUnicode_HASH(self) != -1)
        return _PyUnicode_HASH(self);
    ...
    // else caclculate hash, cache value, return it
    x = _Py_HashBytes(PyUnicode_DATA(self),
                      PyUnicode_GET_LENGTH(self) * PyUnicode_KIND(self));
    _PyUnicode_HASH(self) = x;
    return x;
}

So as you can see, there is no need for Cython to avoid the hash-recalculation - this optimization is already done by CPython.

By using Cython here, one could win up to 10-30% because it would eliminate the interpreter for this part of the code (see for example this SO-post) - not really much, but better than nothing.

ead
  • 32,758
  • 6
  • 90
  • 153