I've been creating huge dicts (millions of entries) and I've noticed that if I create them with the keys in order it is much faster.
I imagine that it has something to do with collisions with the hash function, but can someone explain why is it happening and if it is consistent among versions of python?
Here you have an artificial example:
import timeit
import random
def get_test_data(num, size):
olist, ulist = [], []
for _ in range(num):
otest = [str(i) for i in range(size)]
utest = list(otest)
random.shuffle(utest)
olist.append(otest)
ulist.append(utest)
return olist, ulist
NUM_TESTS = 20
# Precalculate the test data so we only measure dict creation time
ordered, unordered = get_test_data(NUM_TESTS, 1000000)
def test_ordered():
dict((k, k) for k in ordered.pop())
def test_unordered():
dict((k, k) for k in unordered.pop())
print "unordered: ",
print timeit.timeit("test_unordered()",
setup="from __main__ import test_unordered, test_ordered",
number=NUM_TESTS)
print "ordered: ",
print timeit.timeit("test_ordered()",
setup="from __main__ import test_unordered, test_ordered",
number=NUM_TESTS)
The output in my machine consistently is:
(X)$ python /tmp/test.py
unordered: 8.60760807991
ordered: 5.1214389801
I'm using Python 2.7.3 in ubuntu precise x86_64