Fastest way to search a list in python

Question

When you do something like "test" in a where a is a list does python do a sequential search on the list or does it create a hash table representation to optimize the lookup? In the application I need this for I'll be doing a lot of lookups on the list so would it be best to do something like b = set(a) and then "test" in b? Also note that the list of values I'll have won't have duplicate data and I don't actually care about the order it's in; I just need to be able to check for the existence of a value.

Creating a hash would require examining every element in order to build the hash, so that can't be faster unless for some reason comparing two values for equality is very slow. Of course, building such a hash is faster if it will be reused. — Karl Knechtel, Mar 06 '23 at 07:31

orlp · Accepted Answer · 2011-05-13T14:56:41.047

89

Also note that the list of values I'll have won't have duplicate data and I don't actually care about the order it's in; I just need to be able to check for the existence of a value.

Don't use a list, use a set() instead. It has exactly the properties you want, including a blazing fast in test.

I've seen speedups of 20x and higher in places (mostly heavy number crunching) where one list was changed for a set.

edited May 13 '11 at 14:56

answered May 13 '11 at 14:47

orlp

112,504
36
218
315

1

@blcArmadillo: `set's are the way to go since you don't have duplicate data and don't care about order -- plus you can always enumerate the members of the set or quickly turn it into a list if needed. – martineau May 13 '11 at 15:14
2

I used this and it spend things up immensely. Thank you. – Josh Usre Aug 27 '15 at 22:11
2

Wow, I had a dumb script brute forcing through two files to find similar lines, and this just cut the time down from ~20 min to under 1. Thanks! – Parker Oct 29 '15 at 22:18
2

With a very large list and almost ~2 million checks the compute time came down from 3 hours to < 1 min !!!! – vin Apr 30 '17 at 06:58
I think the user should be careful here because using `set(ORIGINAL_LIST)` will raise an exception if the list contains non-hashable types like dictionaries or lists. – Ebram Shehata Feb 03 '22 at 13:35
woah, sets are ridiculously fast. My lists were already unique but still using sets reduced ~30 seconds to ~1 second! – Hammad Mar 06 '22 at 07:30
What if you need order in the original list preserved? – Wojciech Jakubas Oct 10 '22 at 12:01
@WojciechJakubas If you repeatedly need to search, store your data twice, once in the set and once in the list. – orlp Oct 11 '22 at 14:34
Just want to add another anecdote to the "sets are fast" record. I had an AdventOfCode solution take 27 minutes to complete because I was doing a lot of in tests on a list. Changed the list to a set and it completed in 5 seconds. Each iteration added more to the list so it got progressively slower over time with the list but changing a few lines to use a set instead and blazing fast (relatively). – Steve E Dec 14 '22 at 21:57

score 12 · Answer 2 · answered May 13 '11 at 14:48

12

"test" in a with a list a will do a linear search. Setting up a hash table on the fly would be much more expensive than a linear search. "test" in b on the other hand will do an amoirtised O(1) hash look-up.

In the case you describe, there doesn't seem to be a reason to use a list over a set.

answered May 13 '11 at 14:48

Sven Marnach

574,206
118
941
841

This is only true if there are many many lookups done on b after it's constructed. If b needs to be (re)constructed every time a lookup is performed, then `"test" in b` will be slower, since the construction of the set would not be linear. – Jamie Wong May 13 '11 at 14:55
1

@Jamie: From the OP: "In the application I need this for I'll be doing a lot of lookups on the list [...]". Seems like there are many lookups. – Sven Marnach May 13 '11 at 14:58
I agree it's the right solution - just trying to make it clear. – Jamie Wong May 13 '11 at 15:01

score 3 · Answer 3 · answered May 13 '11 at 14:49

I think it would be better to go with the set implementation. I know for a fact that sets have O(1) lookup time. I think lists take O(n) lookup time. But even if lists are also O(1) lookup, you lose nothing with switching to sets.

Further, sets don't allow duplicate values. This will make your program slightly more memory efficient as well

score 2 · Answer 4 · edited May 23 '17 at 11:55

List and tuples seems to have the same time, and using "in" is slow for large data:

>>> t = list(range(0, 1000000))
>>> a=time.time();x = [b in t for b in range(100234,101234)];print(time.time()-a)
1.66235494614
>>> t = tuple(range(0, 1000000))
>>> a=time.time();x = [b in t for b in range(100234,101234)];print(time.time()-a)
1.6594209671

Here is much better solution: Most efficient way for a lookup/search in a huge list (python)

It's super fast:

>>> from bisect import bisect_left
>>> t = list(range(0, 1000000))
>>> a=time.time();x = [t[bisect_left(t,b)]==b for b in range(100234,101234)];print(time.time()-a)
0.0054759979248

List must be sorted first. – Grzegorz Świerad Sep 14 '16 at 12:33 — Grzegorz Świerad, Sep 14 '16 at 12:33

Fastest way to search a list in python

4 Answers4

Linked

Related