Better to use dict of integer keys or a very long list?

Question

Basically, I need to make a lookup table with non-consecutive integer IDs. I'm wondering whether, in terms of lookup speed, I'm generally better off using a dict with integer keys anyway, or using a very long list with a lot of empty indices. It seems to me that a list might still be faster, as Python should know exactly where to look, but I'm wondering if there are any backend processes with the dict to compensate and whether the additional memory requirements for those empty list slots would negate the (probably) more easily traversed list's speed gains. Are there any alternatives to lists and dicts that might be better suited to this?

I have seen this question, but it doesn't totally answer mine: Dictionary access speed comparison with integer key against string key

ETA: I'm implementing lookup tables like this twice in my program. One instance sees a max id of 5,000 with 70-100 objects populated; the other has a max id of 750 with 20-30 populated.

Unless this is incredibly performance sensitive code, I wouldn't worry. Use a `dict` - it's semantically much closer to what you're trying to achieve. — sapi, Jun 23 '14 at 05:48

Bakuriu · Accepted Answer · 2014-06-23T07:18:50.263

To answer your question about dict vs list you'd have to give the exact information about the number of elements, the number of missing elements etc, so tha we could estimate exactly the memory usage of the two data structure and try to predict and/or check their performance.

In general a dict of N key-value pairs requires quite a bit more memory than a list with N values:

The dict must keep track of the keys
The dict is never more than 2/3 full. When this happens the memory allocated is doubled (this is required to have O(1) amortized time operations on the dict).

However there is an alternative to these data structure which should provide very good performance: blist. The blist package provide an interface that matches that of list, only it is implemented using B-trees. It can efficiently handle sparse lists. Most operations take either O(1) or O(log n) time, so they are quite efficient.

For example you could first create a sparse blist doing:

from blist import blist

seq = blist([None])
seq *= 2**30    # create a 2**30 element blist. Instantaneous!

And then you can set only the indices that have a value:

for i, value in zip(indices, values):
    seq[i] = value

The full documentation is here.

Note that blist provides other efficient operations such as:

Concatenating two blists take O(log n) time
Taking an [i:j] slice takes O(log n) time
Inserting an element at a given index takes O(log n) operations
Popping an element (from every position) takes O(log n) operations

Since you gave some numbers, here's how they compare to dicts:

>>> from blist import blist
>>> b = blist([None])
>>> b *= 5000
>>> for i in range(100):b[i] = i
... 
>>> b.__sizeof__()
2660
>>> d = dict()
>>> for i in range(100):d[i] = i
... 
>>> d.__sizeof__()
6216
>>> b = blist([None])
>>> b *= 750
>>> for i in range(30):b[i] = i
... 
>>> b.__sizeof__()
1580
>>> d = dict()
>>> for i in range(30):d[i] = i
... 
>>> d.__sizeof__()
1608

In both cases a blist takes less memory (in the first case it takes 1/3 of the memory of the equivalent dict). Note that the memory taken by a blist also depends on whether or not the indices are contiguous (contiguous is better). However even using random indices:

>>> b = blist([None])
>>> b *= 5000
>>> import random
>>> for i in range(100):b[random.randint(0, 4999)] = i
... 
>>> b.__sizeof__()
2916

It's still much better than the dict.

Even lookup times are better:

In [1]: from blist import blist
   ...: import random
   ...: 

In [2]: b = blist([None])

In [3]: b *= 5000

In [4]: for i in range(100):b[random.randint(0, 4999)] = i

In [5]: %timeit b[0]
10000000 loops, best of 3: 50.7 ns per loop

In [6]: d = dict()

In [7]: for i in range(100):d[random.randint(0, 4999)] = i

In [10]: %timeit d[1024]   # 1024 is an existing key in this dictionary
10000000 loops, best of 3: 70.7 ns per loop

In [11]: %timeit b[1024]
10000000 loops, best of 3: 50.7 ns per loop

Note that a list takes about 47 ns to lookup an index on my machine, so blist is really really close to a list in terms of lookup performance on small lists as what you have.

So, the `blist` seems to be the response to @begueradj 's assertion that it's faster to access a `dict` than a `list`. How does the memory usage of a sparse `blist` compare to a `dict`? It seems like a 95% empty (from the numbers I added to the question) `blist` will still use more memory than a `dict` (although I don't understand how "_never more_ than 2/3 full" works; does a `dict` just occupy 1.5x the size of its data?), and even more memory if I begin by blindly making a large `blist` instead of calculating the max id first, or does `blist` collapse the empty indices somehow? — Rus925, Jun 23 '14 at 06:53
@Rus925 A sparse blist uses `O(log n)` memory where `n` would be the length of the actual list. This means that a `2**30` sparse blist (as in the example) occupies about `k * 30` bytes, where `k` is some constant. The memory used depends on how sparse is the list, and where the elements are (if they are clustered it's going to be more efficient). As the example shows it doesn't have to store all the actual elements (a `2**30` element list cannot be stored in RAM). Obviously if you know how many items you have you can use that number instead of a random big value. — Bakuriu, Jun 23 '14 at 07:06
@Rus925 I have updated my answer. AFAIK, from my tests a 95% empty blist takes *less* memory than a `dict` and is also faster to lookup. — Bakuriu, Jun 23 '14 at 07:22

score 3 · Answer 2 · edited May 23 '17 at 11:51

Lists:
1. append and pop from the end of list are fast
2. insert and pop from the beginning of a list are slow (there is a heavy shit operation behind these 2 functions)
3. it is better to use collection.degue for the 2nd case.

Dictionaries:
4. Access operations are faster compared to lists

Looping through dictionaries and lists:

Dictionaries use iteritems() method to retrieve at the same time the key and its corresponding value.
Lists use enumerate() for the same purpose.

Notes:
If your question is just about looping speed, there is no tangible difference between iteritems() and enumerate()
iteritems() is removed in Python 3.x.
The zip() method is a heavy process to avoid.

score 1 · Answer 3 · answered Jun 23 '14 at 06:33

I think there is no general answer to that question. It depends on the repartition of the integers id, the available memory and your performance requirements. The rules are :

a list lookup is faster, because you do not have to compute the hash of the key.
a dict may be more compact if the greatest value of key is large
if your largest key is very large (about 2^30) you will waste a lot of memory and system will begin swapping which greatly degrades performances

Here is what could be a rule of thumb :

if there are "few" empty values of if you know the largest key will be "reasonably" low (relative to the memory you accept to spend on that) => use a list
if the following requirement is not verified and you do not have strong performance requirement => use a dict
if none of the 2 preceding assumptions are true you will have to try some hash functions optimizations - I detail it below

The theory of the dict is an array for which the index is the result of a hash function applied to the key. Python algorythm is correctly optimized but it is a generalist one. If you know that you have special repartition, you could try to find a hash specially adapted to your repartition. You could find pointers to go further in the Wikipedia article on Hash functions or on the good old standard C library hash

Better to use dict of integer keys or a very long list?

3 Answers3