Test if lists share any items in python

Question

I want to check if any of the items in one list are present in another list. I can do it simply with the code below, but I suspect there might be a library function to do this. If not, is there a more pythonic method of achieving the same result.

In [78]: a = [1, 2, 3, 4, 5]

In [79]: b = [8, 7, 6]

In [80]: c = [8, 7, 6, 5]

In [81]: def lists_overlap(a, b):
   ....:     for i in a:
   ....:         if i in b:
   ....:             return True
   ....:     return False
   ....: 

In [82]: lists_overlap(a, b)
Out[82]: False

In [83]: lists_overlap(a, c)
Out[83]: True

In [84]: def lists_overlap2(a, b):
   ....:     return len(set(a).intersection(set(b))) > 0
   ....:

For a numpy solution see [python - Check if 2 arrays have at least one element in common? - Stack Overflow](https://stackoverflow.com/questions/32969491/check-if-2-arrays-have-at-least-one-element-in-common) (although it's not clear whether it's actually faster) — user202729, Dec 05 '21 at 17:45

score 479 · Accepted Answer · edited May 04 '22 at 12:09

Short answer: use not set(a).isdisjoint(b), it's generally the fastest.

There are four common ways to test if two lists a and b share any items. The first option is to convert both to sets and check their intersection, as such:

bool(set(a) & set(b))

Because sets are stored using a hash table in Python, searching them is O(1) (see here for more information about complexity of operators in Python). Theoretically, this is O(n+m) on average for n and m objects in lists a and b. But

it must first create sets out of the lists, which can take a non-negligible amount of time, and
it supposes that hashing collisions are sparse among your data.

The second way to do it is using a generator expression performing iteration on the lists, such as:

any(i in a for i in b)

This allows to search in-place, so no new memory is allocated for intermediary variables. It also bails out on the first find. But the in operator is always O(n) on lists (see here).

Another proposed option is an hybridto iterate through one of the list, convert the other one in a set and test for membership on this set, like so:

a = set(a); any(i in a for i in b)

A fourth approach is to take advantage of the isdisjoint() method of the (frozen)sets (see here), for example:

not set(a).isdisjoint(b)

If the elements you search are near the beginning of an array (e.g. it is sorted), the generator expression is favored, as the sets intersection method have to allocate new memory for the intermediary variables:

from timeit import timeit
>>> timeit('bool(set(a) & set(b))', setup="a=list(range(1000));b=list(range(1000))", number=100000)
26.077727576019242
>>> timeit('any(i in a for i in b)', setup="a=list(range(1000));b=list(range(1000))", number=100000)
0.16220548999262974

Here's a graph of the execution time for this example in function of list size:

Note that both axes are logarithmic. This represents the best case for the generator expression. As can be seen, the isdisjoint() method is better for very small list sizes, whereas the generator expression is better for larger list sizes.

On the other hand, as the search begins with the beginning for the hybrid and generator expression, if the shared element are systematically at the end of the array (or both lists does not share any values), the disjoint and set intersection approaches are then way faster than the generator expression and the hybrid approach.

>>> timeit('any(i in a for i in b)', setup="a=list(range(1000));b=[x+998 for x in range(999,0,-1)]", number=1000))
13.739536046981812
>>> timeit('bool(set(a) & set(b))', setup="a=list(range(1000));b=[x+998 for x in range(999,0,-1)]", number=1000))
0.08102107048034668

It is interesting to note that the generator expression is way slower for bigger list sizes. This is only for 1000 repetitions, instead of the 100000 for the previous figure. This setup also approximates well when when no elements are shared, and is the best case for the disjoint and set intersection approaches.

Here are two analysis using random numbers (instead of rigging the setup to favor one technique or another):

High chance of sharing: elements are randomly taken from [1, 2*len(a)]. Low chance of sharing: elements are randomly taken from [1, 1000*len(a)].

Up to now, this analysis supposed both lists are of the same size. In case of two lists of different sizes, for example a is much smaller, isdisjoint() is always faster:

Make sure that the a list is the smaller, otherwise the performance decreases. In this experiment, the a list size was set constant to 5.

In summary:

If the lists are very small (< 10 elements), not set(a).isdisjoint(b) is always the fastest.
If the elements in the lists are sorted or have a regular structure that you can take advantage of, the generator expression any(i in a for i in b) is the fastest on large list sizes;
Test the set intersection with not set(a).isdisjoint(b), which is always faster than bool(set(a) & set(b)).
The hybrid "iterate through list, test on set" a = set(a); any(i in a for i in b) is generally slower than other methods.
The generator expression and the hybrid are much slower than the two other approaches when it comes to lists without sharing elements.

In most cases, using the isdisjoint() method is the best approach as the generator expression will take much longer to execute, as it is very inefficient when no elements are shared.

That's some useful data there, shows that big-O analysis isn't the be all and end all of reasoning about running time. — Steve Allison, Jul 18 '13 at 23:21
what about the worst case scenario? `any` exits at the first non-False value. Using a list where the only matching value is at the end, we get this: `timeit('any(i in a for i in b)', setup="a=list(range(1000));b=[x+998 for x in range(999,-0,-1)]", number=1000) 13.739536046981812` `timeit('bool(set(a) & set(b))', setup="a=list(range(1000));b=[x+998 for x in range(999,-0,-1)]", number=1000) 0.08102107048034668` ... and it's with 1000 iterations only. — RobM, Sep 28 '15 at 08:12
Thanks @RobM for the information. I've updated my answer to reflect this and to take into account the other techniques proposed in this thread. — Soravux, Jan 31 '16 at 16:06
Just as a heads-up for anyone who got stuck with this answer like I just did : set() cannot work with lists whose elements are other lists (eg [ [1,2], [3,4]). Doing so yields TypeError: unhashable type: 'list'. — Soltius, Mar 18 '22 at 13:29

score 28 · Answer 2 · answered Jul 03 '10 at 02:21

28

def lists_overlap3(a, b):
    return bool(set(a) & set(b))

Note: the above assumes that you want a boolean as the answer. If all you need is an expression to use in an if statement, just use if set(a) & set(b):

answered Jul 03 '10 at 02:21

John Machin

81,303
11
141
189

6

This is worst-case O(n + m). However, the down side is that it creates a new set, and doesn't bail out when a common element is found early. – Matthew Flaschen Jul 03 '10 at 02:45
1

I curious as to why this is `O(n + m)`. My guess would be that sets are implemented using hash-tables, and thus the `in` operator can work in `O(1)` time (except in degenerate cases). Is this correct? If so, given that hash tables have worst case lookup performance of `O(n)`, does this mean that in the unlike worse case it will have `O(n * m)` performance? – fmark Jul 03 '10 at 03:24
1

@fmark: Theoretically, you are right. Practically, nobody cares; read the comments in Objects/dictobject.c in the CPython source (sets are just dicts with only keys, no values) and see if you can generate a list of keys that will cause O(n) lookup performance. – John Machin Jul 03 '10 at 03:49
Okay, thanks for clarifying, I was wondering if there was some magic going on :) . While I agree that practically I don't need to care, it is trivial to generate a list of keys that will cause `O(n)` lookup performance ;), see http://pastebin.com/Kn3kAW7u Just for lafs. – fmark Jul 03 '10 at 08:06
@fmark: No magic, just engineering. Yeah it's trivial to define a hash function that returns a constant. Ha ha chuckle chuckle. Now try again, without sodding about with a non-standard hash function. – John Machin Jul 03 '10 at 08:19
2

Yeah, I know. Plus I just read the source you pointed me to, which documents even more magic in the case of non-randomish hash functions (such as the built-in one). I assumed it required randomishness, like the Java one, which results in monstrousities like this http://stackoverflow.com/questions/2634690/good-hash-function-for-a-2d-index . I need to keep reminding myself that **Python Is Not Java** (thank deity!). – fmark Jul 03 '10 at 08:28

Matthew Flaschen · Answer 3 · 2010-07-03T23:35:24.963

11

def lists_overlap(a, b):
  sb = set(b)
  return any(el in sb for el in a)

This is asymptotically optimal (worst case O(n + m)), and might be better than the intersection approach due to any's short-circuiting.

E.g.:

lists_overlap([3,4,5], [1,2,3])

will return True as soon as it gets to 3 in sb

EDIT: Another variation (with thanks to Dave Kirby):

def lists_overlap(a, b):
  sb = set(b)
  return any(itertools.imap(sb.__contains__, a))

This relies on imap's iterator, which is implemented in C, rather than a generator comprehension. It also uses sb.__contains__ as the mapping function. I don't know how much performance difference this makes. It will still short-circuit.

edited Jul 03 '10 at 23:35

answered Jul 03 '10 at 02:40

Matthew Flaschen

278,309
50
514
539

2

Loops in intersection approach are all in C code; there's one loop in your approach that includes Python code. The big unknown is whether an empty intersection is likely or unlikely. – John Machin Jul 03 '10 at 02:55
2

You could also use `any(itertools.imap(sb.__contains__, a))` which should be faster still since it avoids using a lambda function. – Dave Kirby Jul 03 '10 at 10:53
Thanks, @Dave. :) I agree removing the lambda is a win. – Matthew Flaschen Jul 03 '10 at 23:31

score 6 · Answer 4 · answered Jul 03 '10 at 02:23

6

You could also use any with list comprehension:

any([item in a for item in b])

answered Jul 03 '10 at 02:23

Ioachim

76
2

6

You could, but the time is O(n * m) whereas the time for the set intersection approach is O(n + m). You could also do it WITHOUT list comprehension (lose the `[]`) and it would run faster and use less memory, but the time would still be O(n * m). – John Machin Jul 03 '10 at 02:29
1

While your big O analysis is true, I suspect that for small values of n and m the time it takes to build the underlying hashtables will come into play. Big O ignores the time it takes to compute the hashes. – Anthony Conyers Jul 03 '10 at 02:49
2

Building a "hashtable" is amortised O(n). – John Machin Jul 03 '10 at 02:59
1

I get that but the constant you're throwing away is pretty big. It doesn't matter for large values of n, but it does for small ones. – Anthony Conyers Jul 03 '10 at 13:36

score 4 · Answer 5 · answered Nov 13 '12 at 10:49

4

In python 2.6 or later you can do:

return not frozenset(a).isdisjoint(frozenset(b))

answered Nov 13 '12 at 10:49

Toughy

767
6
5

1

It seems like one doesn't have to supply a set or frozenset as the first argument. I tried with a string and it worked (i.e.: any iterable will do). – Aktau Feb 11 '15 at 14:30

Anthony Conyers · Answer 6 · 2010-07-03T13:47:19.727

2

You can use the any built in function /w a generator expression:

def list_overlap(a,b): 
     return any(i for i in a if i in b)

As John and Lie have pointed out this gives incorrect results when for every i shared by the two lists bool(i) == False. It should be:

return any(i in b for i in a)

edited Jul 03 '10 at 13:47

answered Jul 03 '10 at 02:28

Anthony Conyers

351
1
5

1

Amplifying Lie Ryan's comment: will give wrong result for any item x that's in the intersection where `bool(x)` is False. In Lie Ryan's example, x is 0. Only fix is `any(True for i in a if i in b)` which is better written as the already seen `any(i in b for i in a)`. – John Machin Jul 03 '10 at 04:00
1

Correction: will give wrong result when *all* items `x` in the intersection are such that `bool(x)` is `False`. – John Machin Jul 03 '10 at 04:07

binoche9 · Answer 7 · 2014-10-08T22:09:27.720

This question is pretty old, but I noticed that while people were arguing sets vs. lists, that no one thought of using them together. Following Soravux's example,

Worst case for lists:

>>> timeit('bool(set(a) & set(b))',  setup="a=list(range(10000)); b=[x+9999 for x in range(10000)]", number=100000)
100.91506409645081
>>> timeit('any(i in a for i in b)', setup="a=list(range(10000)); b=[x+9999 for x in range(10000)]", number=100000)
19.746716022491455
>>> timeit('any(i in a for i in b)', setup="a= set(range(10000)); b=[x+9999 for x in range(10000)]", number=100000)
0.092626094818115234

And the best case for lists:

>>> timeit('bool(set(a) & set(b))',  setup="a=list(range(10000)); b=list(range(10000))", number=100000)
154.69790101051331
>>> timeit('any(i in a for i in b)', setup="a=list(range(10000)); b=list(range(10000))", number=100000)
0.082653045654296875
>>> timeit('any(i in a for i in b)', setup="a= set(range(10000)); b=list(range(10000))", number=100000)
0.08434605598449707

So even faster than iterating through two lists is iterating though a list to see if it's in a set, which makes sense since checking if a number is in a set takes constant time while checking by iterating through a list takes time proportional to the length of the list.

Thus, my conclusion is that iterate through a list, and check if it's in a set.

Using the `isdisjoint()` method on a (frozen)set as indicated by @Toughy is even better: `timeit('any(i in a for i in b)', setup="a= set(range(10000)); b=[x+9999 for x in range(10000)]", number=100000)` => 0.00913715362548828 — Aktau, Feb 11 '15 at 14:34

score 1 · Answer 8 · answered Mar 10 '15 at 16:11

1

if you don't care what the overlapping element might be, you can simply check the len of the combined list vs. the lists combined as a set. If there are overlapping elements, the set will be shorter:

len(set(a+b+c))==len(a+b+c) returns True, if there is no overlap.

answered Mar 10 '15 at 16:11

domoarigato

2,802
4
24
41

If the first value overlaps it will still convert the entire list to a set, no matter how large. – Peter Wood Oct 16 '15 at 13:27
what if list has duplicate numbers? – Anonymous Mar 31 '23 at 08:37

score 1 · Answer 9 · answered Nov 09 '16 at 18:08

I'll throw another one in with a functional programming style:

any(map(lambda x: x in a, b))

Explanation:

map(lambda x: x in a, b)

returns a list of booleans where elements of b are found in a. That list is then passed to any, which simply returns True if any elements are True.

Test if lists share any items in python

9 Answers9

Linked

Related