42

I was using a dictionary as a lookup table but I started to wonder if a list would be better for my application -- the amount of entries in my lookup table wasn't that big. I know lists use C arrays under the hood which made me conclude that lookup in a list with just a few items would be better than in a dictionary (accessing a few elements in an array is faster than computing a hash).

I decided to profile the alternatives but the results surprised me. List lookup was only better with a single element! See the following figure (log-log plot):

list vs dict lookup time

So here comes the question: Why do list lookups perform so poorly? What am I missing?

On a side question, something else that called my attention was a little "discontinuity" in the dict lookup time after approximately 1000 entries. I plotted the dict lookup time alone to show it.

dict lookup time

p.s.1 I know about O(n) vs O(1) amortized time for arrays and hash tables, but it is usually the case that for a small number of elements iterating over an array is better than to use a hash table.

p.s.2 Here is the code I used to compare the dict and list lookup times:

import timeit

lengths = [2 ** i for i in xrange(15)]

list_time = []
dict_time = []
for l in lengths:
    list_time.append(timeit.timeit('%i in d' % (l/2), 'd=range(%i)' % l))
    dict_time.append(timeit.timeit('%i in d' % (l/2),
                                   'd=dict.fromkeys(range(%i))' % l))
    print l, list_time[-1], dict_time[-1]

p.s.3 Using Python 2.7.13

Raymond Hettinger
  • 216,523
  • 63
  • 388
  • 485
hugos
  • 1,313
  • 1
  • 10
  • 19
  • 1
    What's with the `l-1/2`s? That's just going to evaluate to `l`, which is never in your list or dict. – user2357112 Apr 28 '17 at 23:36
  • @user2357112 Ops that wasn't in the original code. thank you for pointing that out, i'll fix it. – hugos Apr 28 '17 at 23:38
  • Wild guess - equality comparison for a list means a `__eq__` lookup on each item... which itself is a `dict` access. – tdelaney Apr 28 '17 at 23:38
  • @tdelaney: No, for an `__eq__` implemented in C, the equality comparison won't actually go through the Python method resolution process, and no dict lookups will be involved. It'll go straight to the C-level `tp_richcompare` or `tp_compare` pointer. – user2357112 Apr 28 '17 at 23:40
  • @bipll this is true for Python 3, I'm using Python 2. I'll add a note about it in the question. – hugos Apr 28 '17 at 23:40
  • Is this python 3? `d=range(%i)` doesn't actually create a list. – tdelaney Apr 28 '17 at 23:41
  • [Can't reproduce.](http://ideone.com/djoTD4) On my tests (Python 2.7.13, like you), I get a faster time for the 1-element list. – user2357112 Apr 28 '17 at 23:42
  • You're measuring a very constrained, highly regular dataset of small consecutive integers, it seems. That probably doesn't tell you anything particularly useful about the average and typical behaviour of these structures. – pvg Apr 28 '17 at 23:44
  • @tdelaney It's Python2 – hugos Apr 28 '17 at 23:44
  • @user2357112 That's weird, I ran it like 5 times and all of them came like this. – hugos Apr 28 '17 at 23:45
  • @HugoSadok: Maybe you actually did have that `-1` in there in your tests. – user2357112 Apr 28 '17 at 23:46
  • Why don't you randomize your data and try with some other types of objects? Your dict lookup is more or less array index lookup, for what you've written. Of course it's going to be snappier than list traversal. – pvg Apr 28 '17 at 23:46
  • @user2357112 Maybe.. I'm running again – hugos Apr 28 '17 at 23:47
  • @user2357112 Yep, it performs better with a single element. It makes more sense now, but that is still a very poor performance for 2 elements... – hugos Apr 28 '17 at 23:49
  • I think it would be better to just ask one question. The list-lookup vs. dict-lookup times and the "jump" in the dict-lookup are due to different effects so the question is a bit "broad". Not a dealbreaker but an answer would have to answer both which could discourage answerers. – MSeifert Apr 28 '17 at 23:50
  • @user2357112 Fixed it – hugos Apr 29 '17 at 00:03
  • @pvg Does it really matter? I mean, the hash function will be applied anyway right? It shouldn't matter if it's sequential or not, even collision probability shouldn't change. – hugos Apr 29 '17 at 00:06
  • @HugoSadok: Due to implementation details of the dict implementation, the collision probability is 0 in all your tests. – user2357112 Apr 29 '17 at 00:06
  • @HugoSadok I think it matters an awful lot. You're mostly benchmarking 'speed of array index lookup' vs 'traversal of half a list' and getting the kind of results one might expect for that. Maybe that's what you wanted to test but it's not necessarily 'dicts are always faster than lists'. If you think the results of your benchmark are weird, one of the first things to check is whether your benchmark itself is weird or representative of what you're trying to measure. – pvg Apr 29 '17 at 00:18
  • @pvg from user2357112 answer we see it **really** does matter. Thank you for helping me find the answer. I'll test for other types and random data soon. – hugos Apr 29 '17 at 00:55

3 Answers3

50

I know lists use C arrays under the hood which made me conclude that lookup in a list with just a few items would be better than in a dictionary (accessing a few elements in an array is faster than computing a hash).

Accessing a few array elements is cheap, sure, but computing == is surprisingly heavyweight in Python. See that spike in your second graph? That's the cost of computing == for two ints right there.

Your list lookups need to compute == a lot more than your dict lookups do.

Meanwhile, computing hashes might be a pretty heavyweight operation for a lot of objects, but for all ints involved here, they just hash to themselves. (-1 would hash to -2, and large integers (technically longs) would hash to smaller integers, but that doesn't apply here.)

Dict lookup isn't really that bad in Python, especially when your keys are just a consecutive range of ints. All ints here hash to themselves, and Python uses a custom open addressing scheme instead of chaining, so all your keys end up nearly as contiguous in memory as if you'd used a list (which is to say, the pointers to the keys end up in a contiguous range of PyDictEntrys). The lookup procedure is fast, and in your test cases, it always hits the right key on the first probe.


Okay, back to the spike in graph 2. The spike in the lookup times at 1024 entries in the second graph is because for all smaller sizes, the integers you were looking for were all <= 256, so they all fell within the range of CPython's small integer cache. The reference implementation of Python keeps canonical integer objects for all integers from -5 to 256, inclusive. For these integers, Python was able to use a quick pointer comparison to avoid going through the (surprisingly heavyweight) process of computing ==. For larger integers, the argument to in was no longer the same object as the matching integer in the dict, and Python had to go through the whole == process.

user2357112
  • 260,549
  • 28
  • 431
  • 505
  • @MSeifert: The next smaller size was 512, and the key searched for in that case was 512/2=256. – user2357112 Apr 29 '17 at 00:17
  • @MSeifert: Also, note the logarithmic horizontal axis. – user2357112 Apr 29 '17 at 00:17
  • Ah, you're right. I could've sworn the spike was (significantly) below 1000 but looking at the code cleared that misunderstanding. Thanks – MSeifert Apr 29 '17 at 00:22
  • @user2357112 Thank you very much! I think this solves the "mystery". I didn't know integer hash worked like this in python -- this explains a lot. I'll test for other types soon. – hugos Apr 29 '17 at 00:53
24

The short answer is that lists use linear search and dicts use amortized O(1) search.

In addition, dict searches can skip an equality test either when 1) hash values don't match or 2) when there is an identity match. Lists only benefit from the identity-implies equality optimization.

Back in 2008, I gave a talk on this subject where you'll find all the details: https://www.youtube.com/watch?v=hYUsssClE94

Roughly the logic for searching lists is:

for element in s:
    if element is target:
        # fast check for identity implies equality
        return True
    if element == target:
        # slower check for actual equality
        return True
return False

For dicts the logic is roughly:

h = hash(target)
for i in probe_sequence(h, len(table)):
    element = key_table[i]
    if element is UNUSED:
        raise KeyError(target)
    if element is target:
        # fast path for identity implies equality
        return value_table[i]
    if h != h_table[i]:
        # unequal hashes implies unequal keys
        continue
    if element == target:
        # slower check for actual equality
        return value_table[i]

Dictionary hash tables are typically between one-third and two-thirds full, so they tend to have few collisions (few trips around the loop shown above) regardless of size. Also, the hash value check prevents needless slow equality checks (the chance of a wasted equality check is about 1 in 2**64).

If your timing focuses on integers, there are some other effects at play as well. That hash of a int is the int itself, so hashing is very fast. Also, it means that if you're storing consecutive integers, there tend to be no collisions at all.

Raymond Hettinger
  • 216,523
  • 63
  • 388
  • 485
0

You say "accessing a few elements in an array is faster than computing a hash".

A simple hashing rule for strings might be just a sum (with a modulo in the end). This is a branchless operation that can compare favorably with character comparisons, especially when there is a long match on the prefix.