5

I've been creating huge dicts (millions of entries) and I've noticed that if I create them with the keys in order it is much faster.

I imagine that it has something to do with collisions with the hash function, but can someone explain why is it happening and if it is consistent among versions of python?

Here you have an artificial example:

import timeit
import random

def get_test_data(num, size):
    olist, ulist = [], []
    for _ in range(num):
        otest = [str(i) for i in range(size)]
        utest = list(otest)
        random.shuffle(utest)
        olist.append(otest)
        ulist.append(utest)
    return olist, ulist

NUM_TESTS = 20
# Precalculate the test data so we only measure dict creation time
ordered, unordered = get_test_data(NUM_TESTS, 1000000)

def test_ordered():
    dict((k, k) for k in ordered.pop())

def test_unordered():
    dict((k, k) for k in unordered.pop())

print "unordered: ",
print timeit.timeit("test_unordered()",
                    setup="from __main__ import test_unordered, test_ordered",
                    number=NUM_TESTS)
print "ordered: ",
print timeit.timeit("test_ordered()",
                    setup="from __main__ import test_unordered, test_ordered",
                    number=NUM_TESTS)

The output in my machine consistently is:

(X)$ python /tmp/test.py 
unordered:  8.60760807991
ordered:  5.1214389801

I'm using Python 2.7.3 in ubuntu precise x86_64

barracel
  • 1,831
  • 13
  • 24
  • 1
    May be related: [Why is processing a sorted array faster than an unsorted array?](http://stackoverflow.com/questions/11227809/why-is-processing-a-sorted-array-faster-than-an-unsorted-array) – Ashwini Chaudhary Aug 14 '13 at 07:05
  • Could be related but we should have a look to the C implementation of the dict – barracel Aug 18 '13 at 05:41

2 Answers2

8

I'm almost certain this is what's going on: when you first create otest, the strings are being stored in order in memory. When you create utest, the strings point to the same memory buffers, except that now those locations are out of order, which kills cache performance on the unordered test cases.

Here's the evidence. I've replaced your get_test_data function with this version:

def get_test_data(num, size):
    olist, ulist = [], []
    for _ in range(num):
        nums = range(size)
        random.shuffle(nums)
        utest = [str(i) for i in nums]
        otest = list(utest)
        otest.sort(key=lambda x: int(x))
        olist.append(otest)
        ulist.append(utest)
    return olist, ulist

The idea is that I'm now constructing the strings in ulist consecutively in memory, then building olist by sorting those strings with the appropriate key. On my machine, this reverses the running times of the two tests.

Greenstick
  • 8,632
  • 1
  • 24
  • 29
disatisfieddinosaur
  • 1,502
  • 10
  • 15
  • The rest of my code is exactly the same as @barracel's above, except that I had to cut the list size by an order of magnitude. My computer doesn't have that much memory :( I got (1.25s, 0.97s) for the original test and (0.93s, 1.09s) for the new one. – disatisfieddinosaur Aug 14 '13 at 07:02
  • You right with you function in my machine i get: "unordered: 7.00250697136 ordered: 7.96612787247." The thing is that in the original code there was only a list that was read from disk. So I think I should improve the sample code to better reflect the situation. – barracel Aug 14 '13 at 07:06
  • Huh. There's still something weird. The discrepancy the other way isn't as dramatic...ideas? – disatisfieddinosaur Aug 14 '13 at 07:09
  • Ah. Here's a theory for the rest of the difference. The Python string hash is a [rolling hash function](http://en.wikipedia.org/wiki/Rolling_hash) that doesn't change much when strings differ only by their last character. This means that every 10 consecutive numbers hash to buckets that are close together, which has ever-so-slightly better cache performance. (dictionary fits in memory, but there are always smaller + faster levels of cache.) – disatisfieddinosaur Aug 14 '13 at 08:00
  • I created a custom string and tested the code with: a random __hash__ and with the standard string __hash__. It confirmed your theory, the creation of the dict using the string hash is faster. Checking the implementation of the string __hash__ it really gives better locality to the memory look ups. May be you could add your comment to your answer. – barracel Aug 18 '13 at 06:36
2

Checking the source code of the python dict you can see that consecutive strings or ints gives less collisions. This combined with @skishore comment about better cache locallity could be the answer.

Major subtleties ahead: Most hash schemes depend on having a "good" hash function, in the sense of simulating randomness. Python doesn't: its most important hash functions (for strings and ints) are very regular in common cases:

>>> map(hash, (0, 1, 2, 3))
[0, 1, 2, 3]
>>> map(hash, ("namea", "nameb", "namec", "named"))
[-1658398457, -1658398460, -1658398459, -1658398462]
>>>

This isn't necessarily bad! To the contrary, in a table of size 2**i, taking the low-order i bits as the initial table index is extremely fast, and there are no collisions at all for dicts indexed by a contiguous range of ints. The same is approximately true when keys are "consecutive" strings. So this gives better-than-random behavior in common cases, and that's very desirable.

Yann Vernier
  • 15,414
  • 2
  • 28
  • 26
barracel
  • 1,831
  • 13
  • 24