204

I have about 10million values that I need to put in some type of look up table, so I was wondering which would be more efficient a list or dict?

I know you can do something like this for both:

if something in dict_of_stuff:
    pass

and

if something in list_of_stuff:
    pass

My thought is the dict will be faster and more efficient.

Thanks for your help.

EDIT 1
Little more info on what I'm trying to do. Euler Problem 92. I'm making a look up table to see if a value calculated has all ready been calculated.

EDIT 2
Efficiency for look up.

EDIT 3
There are no values assosiated with the value...so would a set be better?

Zero Piraeus
  • 56,143
  • 27
  • 150
  • 160
Nope
  • 34,682
  • 42
  • 94
  • 119
  • 1
    Efficiency in terms of what? Insert? Lookup? Memory consumption? Are you checking for pure existance of value, or is there any metadata associated with it? – mthurlin Feb 04 '09 at 23:36
  • 1
    As a side note, you don't need a 10 million list or dict for that specific problem but a much smaller one. – sfotiadis Aug 20 '14 at 14:16
  • http://www.jessicayung.com/how-python-implements-dictionaries/ – Nirmal Sep 04 '20 at 10:03

8 Answers8

254

Speed

Lookups in lists are O(n), lookups in dictionaries are amortized O(1), with regard to the number of items in the data structure. If you don't need to associate values, use sets.

Memory

Both dictionaries and sets use hashing and they use much more memory than only for object storage. According to A.M. Kuchling in Beautiful Code, the implementation tries to keep the hash 2/3 full, so you might waste quite some memory.

If you do not add new entries on the fly (which you do, based on your updated question), it might be worthwhile to sort the list and use binary search. This is O(log n), and is likely to be slower for strings, impossible for objects which do not have a natural ordering.

Torsten Marek
  • 83,780
  • 21
  • 91
  • 98
  • 6
    Yes, but it's a one-off operation if the contents never change. Binary search is O(log n). – Torsten Marek Feb 04 '09 at 23:54
  • OTOH, 10 million integers will take, what, 40 million bytes? If the hash is 2/3 full, that goes to 60 million, and there'll be overhead (anyone know how much?), but the whole thing should fit in a few hundred megs of memory. It might've been a problem ten years ago, but it's not really now. – John Fouhy Feb 05 '09 at 01:26
  • 1
    @John Fouhy: the ints are not stored in the hash table, only pointers, i.e. hou have 40M for the ints (well, not really when a lot of them are small) and 60M for the hash table. I agree that it's not that much of a problem nowadays, still it's worthwhile to keep in mind. – Torsten Marek Feb 05 '09 at 11:04
  • 2
    This is an old question, but I think _amortized O(1)_ may not hold true for very large sets/dicts. The worst case scenario according to http://wiki.python.org/moin/TimeComplexity is O(n). I guess it depends on the internal hashing implementation at what point the average time diverges from O(1) and starts converging on O(n). You can help the lookup performance by compartmentalizing the global sets into smaller sections based on some _easily discernible_ attribute (like the value of the first digit, then the second, third, etc., for as long as you need to get optimal set size). – Nisan.H Sep 21 '12 at 17:51
  • 3
    @TorstenMarek This confuses me. From [this page](https://wiki.python.org/moin/TimeComplexity), list lookup is O(1) and dict lookup is O(n), which is the opposite of what you said. Am I misunderstanding? – temporary_user_name Nov 10 '13 at 22:03
  • 3
    @Aerovistae I think you misread the info on that page. Under list, I see O(n) for "x in s" (lookup). It also shows set and dict lookup as O(1) average case. – Dennis Mar 06 '14 at 03:23
  • 1
    The dict implementation has been changed for Python 3.6. The "2/3 full" thing isn't exactly accurate anymore (though it sort of is). – Rick Jun 21 '17 at 09:49
59

A dict is a hash table, so it is really fast to find the keys. So between dict and list, dict would be faster. But if you don't have a value to associate, it is even better to use a set. It is a hash table, without the "table" part.


EDIT: for your new question, YES, a set would be better. Just create 2 sets, one for sequences ended in 1 and other for the sequences ended in 89. I have sucessfully solved this problem using sets.

nosklo
  • 217,122
  • 57
  • 293
  • 297
  • You don't even need sets. Just use a dynamic programming approach recording for each number if the chain ends at 1, 89, or unknown. – qwr Aug 10 '23 at 04:54
47

set() is exactly what you want. O(1) lookups, and smaller than a dict.

David Metcalfe
  • 2,237
  • 1
  • 31
  • 44
recursive
  • 83,943
  • 34
  • 151
  • 241
39

I did some benchmarking and it turns out that dict is faster than both list and set for large data sets, running python 2.7.3 on an i7 CPU on linux:

  • python -mtimeit -s 'd=range(10**7)' '5*10**6 in d'

    10 loops, best of 3: 64.2 msec per loop

  • python -mtimeit -s 'd=dict.fromkeys(range(10**7))' '5*10**6 in d'

    10000000 loops, best of 3: 0.0759 usec per loop

  • python -mtimeit -s 'from sets import Set; d=Set(range(10**7))' '5*10**6 in d'

    1000000 loops, best of 3: 0.262 usec per loop

As you can see, dict is considerably faster than list and about 3 times faster than set. In some applications you might still want to choose set for the beauty of it, though. And if the data sets are really small (< 1000 elements) lists perform pretty well.

Thiem Nguyen
  • 6,345
  • 7
  • 30
  • 50
EriF89
  • 730
  • 7
  • 12
  • Shouldn't it be exactly the opposite? List: 10 * 64.2 * 1000 = 642000 usec, dict: 10000000 * 0.0759 = 759000 usec and set: 1000000 * 0.262 = 262000 usec ... so sets are the fastest, followed by the list and with dict as last on your example. Or am I missing something? – andzep Nov 09 '12 at 12:15
  • 1
    ... but the question for me here is: what does this times are actually measuring? Not the access time for a given list, dict or set, but much more, the time and loops to _create_ the list, dict, set and finally to find and access one value. So, does this have to do with the question at all? ... It's interesting though... – andzep Nov 09 '12 at 12:25
  • 11
    @andzep, you are mistaken, the `-s` option is to setup the `timeit` environment, i.e. it does not count in the total time. The `-s` option is run only once. On Python 3.3, I get these results: gen (range) -> 0.229 usec, list -> 157 msec, dict -> 0.0806 usec, set -> 0.0807 usec. Set and dict performance is the same. Dict however takes a bit longer to initialize than set (total time 13.580s v. 11.803s) – sleblanc Jun 17 '13 at 03:48
  • 4
    why not use builtin set? I actually get much worse results with sets.Set() than with builtin set() – Thomas Guyot-Sionnest Oct 27 '16 at 14:46
  • 4
    @ThomasGuyot-Sionnest The built in set was introduced in python 2.4 so I'm not sure why I didn't use it in my proposed solution. I get good performance with `python -mtimeit -s "d=set(range(10**7))" "5*10**6 in d"` using Python 3.6.0 (10000000 loops, best of 3: 0.0608 usec per loop), roughly the same as the dict benchmark so thank you for your comment. – EriF89 Jan 25 '17 at 15:17
  • 3
    pretty sure range produces a range object.. not a list – 0TTT0 Nov 05 '17 at 17:26
  • On Python 3.7.7 (3.30GHz Xeon processor), the timings are: `69.3 nsec`, `30 nsec` and `30.7 nsec` respectively. – Nirmal Sep 04 '20 at 09:33
  • @0TTT0: That was true in 2017, but not in 2009. – recursive Jun 21 '22 at 17:22
11

You want a dict.

For (unsorted) lists in Python, the "in" operation requires O(n) time---not good when you have a large amount of data. A dict, on the other hand, is a hash table, so you can expect O(1) lookup time.

As others have noted, you might choose a set (a special type of dict) instead, if you only have keys rather than key/value pairs.

Related:

  • Python wiki: information on the time complexity of Python container operations.
  • SO: Python container operation time and memory complexities
Community
  • 1
  • 1
zweiterlinde
  • 14,557
  • 2
  • 27
  • 32
  • 1
    Even for sorted lists, "in" is O(n). –  Feb 05 '09 at 03:39
  • 2
    For a linked list, yes---but "lists" in Python are what most people would call vectors, which provide indexed access in O(1) and a find operation in O(log n), when sorted. – zweiterlinde Feb 05 '09 at 14:49
  • Are you saying that the `in` operator applied to a sorted list performs better than when applied to an unsorted one (for a search of a random value)? (I don't think whether they are implemented internally as vectors or as nodes in a linked-list is relevant.) – martineau Nov 27 '10 at 12:59
8

As a new set of tests to show @EriF89 is still right after all these years:

$ python -m timeit -s "l={k:k for k in xrange(5000)}"    "[i for i in xrange(10000) if i in l]"
1000 loops, best of 3: 1.84 msec per loop
$ python -m timeit -s "l=[k for k in xrange(5000)]"    "[i for i in xrange(10000) if i in l]"
10 loops, best of 3: 573 msec per loop
$ python -m timeit -s "l=tuple([k for k in xrange(5000)])"    "[i for i in xrange(10000) if i in l]"
10 loops, best of 3: 587 msec per loop
$ python -m timeit -s "l=set([k for k in xrange(5000)])"    "[i for i in xrange(10000) if i in l]"
1000 loops, best of 3: 1.88 msec per loop

Here we also compare a tuple, which are known to be faster than lists (and use less memory) in some use cases. In the case of lookup table, the tuple faired no better .

Both the dict and set performed very well. This brings up an interesting point tying into @SilentGhost answer about uniqueness: if the OP has 10M values in a data set, and it's unknown if there are duplicates in them, then it would be worth keeping a set/dict of its elements in parallel with the actual data set, and testing for existence in that set/dict. It's possible the 10M data points only have 10 unique values, which is a much smaller space to search!

SilentGhost's mistake about dicts is actually illuminating because one could use a dict to correlate duplicated data (in values) into a nonduplicated set (keys), and thus keep one data object to hold all data, yet still be fast as a lookup table. For example, a dict key could be the value being looked up, and the value could be a list of indices in an imaginary list where that value occurred.

For example, if the source data list to be searched was l=[1,2,3,1,2,1,4], it could be optimized for both searching and memory by replacing it with this dict:

>>> from collections import defaultdict
>>> d = defaultdict(list)
>>> l=[1,2,3,1,2,1,4]
>>> for i, e in enumerate(l):
...     d[e].append(i)
>>> d
defaultdict(<class 'list'>, {1: [0, 3, 5], 2: [1, 4], 3: [2], 4: [6]})

With this dict, one can know:

  1. If a value was in the original dataset (ie 2 in d returns True)
  2. Where the value was in the original dataset (ie d[2] returns list of indices where data was found in original data list: [1, 4])
hamx0r
  • 4,081
  • 1
  • 33
  • 46
  • For your last paragraph, while it makes sense reading it, it would be nice (and probably easier to grasp) to see the actual code you are trying to explain. – kaiser Jun 29 '18 at 19:31
4

if data are unique set() will be the most efficient, but of two - dict (which also requires uniqueness, oops :)

SilentGhost
  • 307,395
  • 66
  • 306
  • 293
0

You don't actually need to store 10 million values in the table, so it's not a big deal either way.

Hint: think about how large your result can be after the first sum of squares operation. The largest possible result will be much smaller than 10 million...

Kiv
  • 31,940
  • 6
  • 44
  • 59