46

I want to know which elements of list_1 are in list_2. I need the output as an ordered list of booleans. But I want to avoid for loops, because both lists have over 2 million elements.

This is what I have and it works, but it's too slow:

list_1 = [0,0,1,2,0,0]
list_2 = [1,2,3,4,5,6]

booleans = []
for i in list_1:
   booleans.append(i in list_2)

# booleans = [False, False, True, True, False, False]

I could split the list and use multithreading, but I would prefer a simpler solution if possible. I know some functions like sum() use vector operations. I am looking for something similar.

How can I make my code more efficient?

Bernhard Barker
  • 54,589
  • 14
  • 104
  • 138
herdek550
  • 625
  • 1
  • 5
  • 10
  • 1
    This might help [Fastest way to check if a value exists in a list](https://stackoverflow.com/questions/7571635/fastest-way-to-check-if-a-value-exists-in-a-list) – Tharu Apr 24 '22 at 16:50
  • I'll admit I'm not familiar enough with vectorizing, but it seems that if you're specifying that the output is an ordered list of booleans, you're unnecessarily slowing things down. How are you using this output? – Teepeemm Apr 25 '22 at 16:41
  • 2
    `numpy.sum()` uses vector operations, but I don't think `sum()` does – Pranav Hosangadi Apr 25 '22 at 19:02
  • 1
    Here's an old question on *unordered* list intersection. https://stackoverflow.com/q/3697432/4014959 My answer there has some timeit tests. – PM 2Ring Apr 27 '22 at 17:51

8 Answers8

86

I thought it would be useful to actually time some of the solutions presented here on a larger sample input. For this input and on my machine, I find Cardstdani's approach to be the fastest, followed by the numpy isin() approach.

Setup 1

import random

list_1 = [random.randint(1, 10_000) for i in range(100_000)]
list_2 = [random.randint(1, 10_000) for i in range(100_000)]

Setup 2

list_1 = [random.randint(1, 10_000) for i in range(100_000)]
list_2 = [random.randint(10_001, 20_000) for i in range(100_000)]

Timings - ordered from fastest to slowest (setup 1).

Cardstdani - approach 1


I recommend converting Cardstdani's approach into a list comprehension (see this question for why list comprehensions are faster)

s = set(list_2)
booleans = [i in s for i in list_1]

# setup 1
6.01 ms ± 15.7 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
# setup 2
4.19 ms ± 27.7 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)

No list comprehension

s = set(list_2)
booleans = []
for i in list_1:
   booleans.append(i in s)

# setup 1
7.28 ms ± 27.3 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
# setup 2
5.87 ms ± 8.19 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)

Cardstdani - approach 2 (with an assist from Timus)


common = set(list_1) & set(list_2)
booleans = [item in common for item in list_1]

# setup 1
8.3 ms ± 34.8 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
# setup 2
6.01 ms ± 26.3 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)

Using the set intersection method

common = set(list_1).intersection(list_2)
booleans = [item in common for item in list_1]

# setup 1
10.1 ms ± 29.6 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
# setup 2
4.82 ms ± 19.5 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)

numpy approach (crissal)


a1 = np.array(list_1)
a2 = np.array(list_2)

a = np.isin(a1, a2)

# setup 1
18.6 ms ± 74.2 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
# setup 2
18.2 ms ± 47.2 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
# setup 2 (assuming list_1, list_2 already numpy arrays)
10.3 ms ± 73.5 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)

list comprehension


l = [i in list_2 for i in list_1]

# setup 1
4.85 s ± 14.6 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
# setup 2
48.6 s ± 823 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

Sharim - approach 1


booleans = list(map(lambda e: e in list_2, list_1))

# setup 1
4.88 s ± 24.1 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
# setup 2
48 s ± 389 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

Using the __contains__ method

booleans = list(map(list_2.__contains__, list_1))

# setup 1
4.87 s ± 5.96 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
# setup 2
48.2 s ± 486 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

Sharim - approach 2


set_2 = set(list_2)
booleans = list(map(lambda e: set_2 != set_2 - {e}, list_1))

# setup 1
5.46 s ± 56.1 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
# setup 2
11.1 s ± 75.3 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

Varying the length of the input


Employing the following setup

import random 

list_1 = [random.randint(1, n) for i in range(n)]
list_2 = [random.randint(1, n) for i in range(n)]

and varying n in [2 ** k for k in range(18)]:

enter image description here

Employing the following setup

import random 

list_1 = [random.randint(1, n ** 2) for i in range(n)]
list_2 = [random.randint(1, n ** 2) for i in range(n)]

and varying n in [2 ** k for k in range(18)], we obtain similar results:

enter image description here

Employing the following setup

list_1 = list(range(n))
list_2 = list(range(n, 2 * n))

and varying n in [2 ** k for k in range(18)]:

enter image description here

Employing the following setup

import random 

list_1 = [random.randint(1, n) for i in range(10 * n)]
list_2 = [random.randint(1, n) for i in range(10 * n)]

and varying n in [2 ** k for k in range(18)]:

enter image description here

OTheDev
  • 2,916
  • 2
  • 4
  • 20
  • 3
    I think @Cardstdani s approach should be modified to `common = set(list_1) & set(list_2); booleans = [item in common for item in list_1]`. – Timus Apr 24 '22 at 17:43
  • 1
    Let me add this in @Timus. – OTheDev Apr 24 '22 at 17:46
  • 1
    @SharimIqbal I don't think this is a good idea: You're modifying a whole set in each test, that takes a lot of time -- see the results – Timus Apr 24 '22 at 17:59
  • @SharimIqbal Unfortunately, that one is no faster than the approach listed at the bottom of my post. Do you want me to include it? – OTheDev Apr 24 '22 at 18:01
  • 1
    Thank you everyone! Happy to do it! If I get the time I will try to add a plot that shows the timings for each approach for varying input sizes. – OTheDev Apr 24 '22 at 18:10
  • 1
    @Cardstdani I hope you do not mind but I added a list comprehension version of yours and called it yours. It is a little faster for me. – OTheDev Apr 24 '22 at 18:17
  • 2
    Thanks @EricDuminil for the comments. The computer is currently working on the `random.randint(1, n**2)` case right now. I will also try to run the worst-case test you mentioned. – OTheDev Apr 25 '22 at 14:44
  • @EricDuminil I have updated my post to try to include some of the things you mentioned. Did I miss anything (I am currently redoing the graphs to include all of Sharim's solutions)? – OTheDev Apr 25 '22 at 19:52
  • 1
    You might be able to further optimize `set(list_1) & set(list_2)` to `set(list_1).intersection(list_2)`. Can't remember when `intersection()` was upgraded to accept any iterable rather than requiring another set, but ages ago. – Steve Jessop Apr 26 '22 at 03:53
  • And `lambda e:e in list_2` to `list_2.__contains__`. I realise you didn't write these solutions and aren't responsible for them, but since you're running the benchmarks you're probably also the person who potentially cares about such local optimisations, if indeed they do shave anything off :-) – Steve Jessop Apr 26 '22 at 03:57
  • @SteveJessop I think you are right regarding the minor gains that can be obtained using `intersection()`. Initial tests don't show much gains from using `__contains__` for larger inputs. Anyways, I am running your suggestions now. I am also going to take away all of the `iter()`s above as there is absolutely no need for them. Thanks for the suggestions! – OTheDev Apr 26 '22 at 15:47
  • 3
    FWIW my first instinct was Cardstdani set+list comprehension variant, and that's what I'd stick with unless I saw a big gain from some other suggestion. Just feels like the simplest way to express "line 1: we've got the wrong data structure for this task, so create the correct one", "line 2: do what we're actually here for". Done. – Steve Jessop Apr 26 '22 at 16:23
  • @EricDuminil Such ranges are **not** worst cases, making a set from them is [much faster](https://tio.run/##bVLRjoIwEHzvV@yLoTVoUM7kQuKXEEJQFm2OFtLWO83lvp1rC0W9k5dud2dnZtn2N3PuZPreq2FoVCdAVbK2Bxd9pwzo86VpWiS@dOAajyaUuNT2IKRT/MRl1cIeWq4NtQQnpNskKZMkYYxMFLWtB2ieFSFNQ5kRhRK/sC4fGvLV6gpNp@Bq5YKZ2jYfqwlwhSVstunb7g4LKgH2RPgC/0A7ReXMPwav/E4YRoisBGrHHQXlKIYo4HzsofeofCo@E7rU3z8RFeMCDBfI5wUo7LEyxGgn/u1cZJAXfi53caN5az/EpUp3H3ezSVhGwH5hJg9jPvWve0S6r9XGCuFn1foGNhdcWnBJR0O0rcShrjLQaGNtWAzyIg6o9ht27xlfD7Xmc0dWxGAYeRR3ZaxHazF84G1vsesTmsl7r7g0dEmjxW69bUBoiGAB1Lj9Ysr8IO6RwiyRZ2nhzDjvw/AL) than from variations. – Kelly Bundy May 06 '22 at 02:49
  • @EricDuminil Yes, for the non-set solutions it's a worst case. But for set solutions it's not, and to me it seems inappropriate to use inputs that are worst cases for some solutions and best(?) cases for other solutions. – Kelly Bundy May 06 '22 at 03:08
  • `[*map(set(list_2).__contains__, list_1)]` and `[*map(contains, repeat(set(list_2)), list_1)]` seem to be a bit faster than @Cardstdani's. For `n=50_000` and `[random.randint(1, n ** 2) for i in range(n)]` I get about 6.7 ms vs 7.1 ms: – Kelly Bundy May 06 '22 at 03:16
  • [Benchmark code and results](https://tio.run/##lVLLiuMwELzrK/oyjGSMsR0CSyBfYoJQ4vaMWEs2UucQlv32jNqyZ8zsHnZ10Ku7qkvVmh/0PvnDjzk8nz0OcDNB9@jJRpJjmnRTwrK26iQgjQhniLgGW7XcBaR78NBZsD4lDFOAZZsJLkIw808cx0fzd9KNoHBmljv2Suvb5MlYH7VeMY3aE7b/QLhRlOl6RkP7CmrPOtz9jd/XZa3lWqLcu8JpYXJgCQNN0xjBunkKtHLn4DRjMMQ25NimQKxnsg4tbadgfJ9AGXq1EW@0ARMmLUL4pOpY67quRZbLKjOu4sV6kkmvh6KAVn11IMXeUPr0uPze/4YJiuzIL7bmBN1lyeEDpy1@/RZ8pb9gx3prQi4U3@/DMKJcsvOH@YMkA3hQquasl9miau3YaNy1N6cl@XvHk/67u2I4N0p90mTjZFLfMeZSAimxL8th7LMobvTjnHKrN6RV/BzYnEK@vhyrdgAX4RVeQBIU0OAhe8X9gc8S3elwSWJ4n/6tNw61Vs/nBw). – Kelly Bundy May 06 '22 at 03:16
  • 1
    The numpy solution is interesting. Its times seem to grow less than the supposedly linear-time set-solutions. I wonder how it's implemented. – Kelly Bundy May 06 '22 at 03:42
  • @EricDuminil Ok yes, it achieves that goal of further separating O(n) and O(n²), and maybe O(6n) wouldn't look much different. I still find it worth pointing out in case it's not clear to everyone, and in case someone for example tries to measure only the O(n) ones with a (realistic, not quadratic) worst case for them. Yes, a graph with large n would be interesting, with at least up to the "over 2 million elements" stated by the OP. If numpy is O(n), it might end up fastest. Or if it takes O(n log n), it might curve upwards away from the O(n) pack. – Kelly Bundy May 06 '22 at 13:45
  • @KellyBundy: Thanks for the interesting comments. [`np.isin`](https://github.com/numpy/numpy/blob/v1.22.0/numpy/lib/arraysetops.py#L640-L736) is probably O(n.log n) indeed, since it calls [`in1d`](https://github.com/numpy/numpy/blob/4adc87dff15a247e417d50f10cc4def8e1c17a03/numpy/lib/arraysetops.py#L520), which removes duplicates from `array1` and `array2`, sorts `array1 + array2`, and checks the sorted array for duplicate values. (At least that's how I understand the code). It's clever, and can be optimized with a `assume_unique` flag. – Eric Duminil May 06 '22 at 14:56
  • Thanks for your contributions Kelly and Eric. I am a little preoccupied with work right now (and my CPUs are too) but I am just writing to let you know I am listening and that I will try to incorporate your suggestions. Your use of `repeat` @KellyBundy is quite innovative. I will need to look into the C implementation for that one to understand it a little more. – OTheDev May 06 '22 at 14:56
  • I think that in order for meaningful comparisons to be made, it is important that the input and the output are well-specified (types, sizes, etc.). OP mentions that the inputs are lists so I have maintained that assumption. The fact that the inputs are assumed to be lists means that if we want to use the suggested `isin` numpy approach that the timings need to include the conversion from lists to arrays. I can look into adding one graph that bases it's timings on the input being numpy arrays for those that do stumble upon this post and find that their inputs are already numpy arrays. – OTheDev May 06 '22 at 15:03
  • @EricDuminil Though the input size does not reach 2 million (I can try), (1) the lists in the third graph are disjoint (2) the lists in Setup 2 (not the graph) are also disjoint. Notice how for setup 2 the difference between the timings between set approaches versus the nonset approaches are quite significant (for example, about 18 **ms** for the numpy approach versus 48.6 **s** for the list comprehension) . So I think they help make your point? – OTheDev May 06 '22 at 15:13
  • 1
    The lists of the second graph are practically also disjoint, the birthday paradox says hello. First graph's aren't, but still take a lot of searching. [Percentages of actual vs potential search costs](https://tio.run/##lZDdagMhEIXvfYq5KdFF0rq9aAj0SUIJkriJsKuLztKEkGff@pNupD@06406M@c7h@nPeLTmedW7cdRdbx2Ck2ZvOyBkrxrwR/u@3VmPlK0JhBPfHl7hck3fxjrQHE6gDSgzdMpJVLTVHrc1B3ETTcKlVxiwcmiRnjholtpyh4Nsk00g@6Gjefig0lSrzI3IWDJMbqkiMqC3qAzqO2OSCAZVCcjjThukzeJSGj9@pawfrgtGiAm4GqoKxAshmRkqm7ylZbwiTHAwOZyO4UL5oKhhb1lRz1EUK//DMKaq57v@LvvZOj7o5xC7s8s6jzuC2P5//G8RxFNizF3aJCusx/ED) for the four graphs. – Kelly Bundy May 06 '22 at 15:20
  • @KellyBundy: When you mention the birthday paradox, you refer to the left part of this diagram, correct? https://en.wikipedia.org/wiki/Birthday_problem#/media/File:Birthdaymatch.svg When I think about the birthday paradox, I typically concentrate on the middle and right part. – Eric Duminil May 07 '22 at 08:06
  • @EricDuminil I guess so, yes. What I meant is the square root rule of thumb. In order to have a 50% chance of having *any* duplicate, you need about sqrt(possiblevalues) elements. Which we do there, as we have n elements from a range of size n^2 (or rather 2n, in the two lists together). So quite likely there aren't any duplicates or only one duplicate in those list pairs. – Kelly Bundy May 07 '22 at 10:14
  • 1
    @EricDuminil (continuing) [Experiment](https://tio.run/##nY7BDsIgEETvfMWeDIumkXrQmPTkZzSNMRWVBHYJxYNfX4vUqFe5wM7wZic80o1pswtxHK0PHBPEE53Zi0tkDz07Z/pkmQaY7QPfKZkoRM/eZ715SxIFTVOtlN6KC0c4gqUcdzVSr9e4FzAdZ4d01NO/tiyq8mUpSb0CAqWgRsiw/cCE3Qet/0JL2QnNGVK@G5cyCAv4UWrEyjjjDaVBIn4lDK0zJMsbO1g2oF9uiLmHmg1RxhnBcXwC) with `n=2**17`: In 100 attempts, 73 times there was at most one common value in the two lists. Full distribution: `{0: 36, 1: 37, 2: 22, 3: 3, 4: 2}`. – Kelly Bundy May 07 '22 at 10:26
40

You can take advantage of the O(1) in operator complexity for the set() function to make your for loop more efficient, so your final algorithm would run in O(n) time, instead of O(n*n):

list_1 = [0,0,1,2,0,0]
list_2 = [1,2,3,4,5,6]

s = set(list_2)
booleans = []
for i in list_1:
   booleans.append(i in s)
print(booleans)

It is even faster as a list comprehension:

s = set(list_2)
booleans = [i in s for i in list_1]

If you only want to know the elements, you can use an intersection of sets like that, which will be an efficient solution due to the use of set() function, already optimized by other Python engineers:

list_1 = [0,0,1,2,0,0]
list_2 = [1,2,3,4,5,6]

print(set(list_1).intersection(set(list_2)))

Output:

{1, 2}

Also, to provide a list format output, you can turn your resulting set into a list with list() function:

print(list(set(list_1).intersection(set(list_2))))
OTheDev
  • 2,916
  • 2
  • 4
  • 20
Cardstdani
  • 4,999
  • 3
  • 12
  • 31
  • 1
    Sorry I wasn't clear, I need the whole list and ordered – herdek550 Apr 24 '22 at 16:51
  • @herdek550 What are you want the list to contain `True` and `False` or The element which is not the same. – codester_09 Apr 24 '22 at 16:52
  • 2
    @Cardstdani I was going to include a solution using `set` too. I want to point out that for larger lists (just tested this) the gains from using your way are huge compared to the highest-voted solution and a basic list comprehension. – OTheDev Apr 24 '22 at 17:06
  • @oda I don't know how much it takes to build the set, which can be the only disadvantage of this algorithm, but I think O(n) in larger is enough gain compared to O(n*n) – Cardstdani Apr 24 '22 at 17:08
  • I've run some tests: The intersection as base in combination with a list comprehension seems faster than the `numpy.isin` solution. – Timus Apr 24 '22 at 17:26
  • @Timus you mean checking the elements of `list_1` directly on the intersection of sets? – Cardstdani Apr 24 '22 at 17:30
  • @Timus I found @Cardstdani's solution to be faster than `numpy` for two 100k element lists of random numbers in [1, 10000]. – OTheDev Apr 24 '22 at 17:31
  • @Cardstdani I have included a few timings below. – OTheDev Apr 24 '22 at 17:31
  • 2
    @oda I've tried `common = set(list_1) & set(list_2); result = [item in common for item in list_1]` and it's takes about half the time of `numpy.isin` here. – Timus Apr 24 '22 at 17:32
  • @Timus yes, this is similiar to Cardstdani's method, or maybe even faster (`common` is smaller than `set(list_2)`) – crissal Apr 24 '22 at 17:44
  • @crissal It's faster, imho – Timus Apr 24 '22 at 17:45
  • Yep, I think it so, but I think they're in the same time range – crissal Apr 24 '22 at 17:47
  • At the and I will use your original anwser as well. I ran into another problem and your idea of using set().intersection() is perfect for it, thanks. – herdek550 Apr 24 '22 at 18:43
  • "You can take advantage of the O(1) in operator complexity for the set() function..." This phrasing confused me a lot, because it sounded to me like you was saying that the `set` constructor function runs in O(1), which as far as I know is not true (AFAIK it's O(n)). But after a couple reads, I think you mean to say that the `in` operator for the `set` type is O(1), which is definitely true and relevant. – Chris Bouchard Apr 25 '22 at 04:08
  • 3
    Also, a note: If this is an operation you need to perform frequently on long-lived lists, it might be worth caching the set and keeping it updated as the list changes. That way you would avoid the O(n) hit of converting the list to a set each time. It wouldn't change the O complexity, but it would speed up runtime. You could even write/find a datatype that provides indexing and O(1) search (a list+set for lack of a better name). – Chris Bouchard Apr 25 '22 at 04:15
  • It seems to me the main performance difference between `booleans = [] ... append(...)` and list comprehension is the initial size of the array. `booleans = [False] * len(list_1) ... booleans[n] = ...` should be just as fast – Falco Apr 26 '22 at 07:50
15

If you want to use a vector approach you can also use Numpy isin. It's not the fastest method, as demonstrated by oda's excellent post, but it's definitely an alternative to consider.

import numpy as np

list_1 = [0,0,1,2,0,0]
list_2 = [1,2,3,4,5,6]

a1 = np.array(list_1)
a2 = np.array(list_2)

np.isin(a1, a2)
# array([False, False,  True,  True, False, False])
crissal
  • 2,547
  • 7
  • 25
  • 2
    *Mea culpa*. I just checked the source code of `np.isin`, and it seems to be a better algorithm than I had assumed. [`np.isin`](https://github.com/numpy/numpy/blob/v1.22.0/numpy/lib/arraysetops.py#L640-L736) is probably O(n.log n) indeed, since it calls [`in1d`](https://github.com/numpy/numpy/blob/4adc87dff15a247e417d50f10cc4def8e1c17a03/numpy/lib/arraysetops.py#L520), which removes duplicates from `array1` and `array2`, sorts `array1 + array2`, and checks the sorted array for duplicate values. (At least that's how I understand the code). – Eric Duminil May 06 '22 at 14:59
5

You can use the map function.

Inside map I use the lambda function. If you are not familiar with the lambda function then you can check this out.

list_1 = [0,0,1,2,0,0]
list_2 = [1,2,3,4,5,6]

booleans = list(map(lambda e:e in list_2,iter(list_1)))
print(booleans)

output

[False, False, True, True, False, False]

However, if you want the only elements which are not the same then instead of a map function you can use the filter function with the same code.

list_1 = [0,0,1,2,0,0]
list_2 = [1,2,3,4,5,6]

new_lst = list(filter(lambda e:e in list_2,iter(list_1)))# edited instead of map use filter.
print(new_lst)

output

[1, 2]

Edited

I am removing the in statement from the code because in also acts as a loop. I am checking using the timeit module.

you can use this code for the list containing True and False.

This way is fastest then above one.

list_1 = [0,0,1,2,0,0]
list_2 = [1,2,3,4,5,6]
set_2 = set(list_2)

booleans = list(map(lambda e:set_2!=set_2-{e},iter(list_1)))
print(booleans)

output

[False, False, True, True, False, False]

This one is for the list containing the elements.

list_1 = [0,0,1,2,0,0]
list_2 = [1,2,3,4,5,6]
set_2 = set(list_2)

booleans = list(filter(lambda e:set_2!=set_2-{e},iter(list_1))) # edited instead of map use filter
print(booleans)

output

[1,2]

Because OP don't want to use lambda function then this.

list_1 = [0,0,1,2,0,0]*100000
list_2 = [1,2,3,4,5,6]*100000
set_2 = set(list_2)
def func():
    return set_2!=set_2-{e}

booleans = list(map(func,iter(list_1)))

I know my way isn't the best way to this answer this because I am never using NumPy much.

codester_09
  • 5,622
  • 2
  • 5
  • 27
  • @Sharim Iqbal Ahh, maybe I screwd my implementation. Anyways thanks for your anwer and deep explanation. But at the end I will stick with numpy, because I have never used lambda function. And numpy systax and functions are similar to R which I am familiar with. – herdek550 Apr 24 '22 at 18:39
  • 1
    @herdek550 instead of lambda you can use the simple function I am editing my answer. – codester_09 Apr 24 '22 at 18:41
  • In your test of whether numpy was really faster, you start the timing before you import numpy. This means that you also include the importing of numpy in your timings which is misleading. – Gerrit Luimstra Apr 25 '22 at 12:23
  • 2
    I would argue differently. If we are talking about time complexities, then a constant addition to the timing (importing a library) should not be included. You can of course make a note that the numpy version takes a little longer to start up (due to the import) but in the case of large N this is not relevant. – Gerrit Luimstra Apr 25 '22 at 12:33
  • Doesn't `set_2-{e}` have to construct a new copy of the set with one element removed? Is that why oda's testing found versions using that run as slowly as O(n^2) implementations? Surely there are faster ways to check for non-membership in a set? (Python isn't a language I know well, so maybe I'm missing something or misreading the syntax. Or maybe this is less bad on an implementation other then CPython, if they can JIT optimize it better instead of naively interpreting?) – Peter Cordes Apr 26 '22 at 11:15
  • 1
    @PeterCordes Yes, it does have to make a copy of `set_2` and remove `e` from that copy. So it consumes time, and RAM. – PM 2Ring Apr 27 '22 at 17:59
  • 2
    map & filter are ok if the function arg is an existing function (ideally, one that runs at C speed, like a built-in). It's not so good to use them with lambdas: it's better to use a list comp or generator and avoid the extra function call on every loop iteration (Python function calls have more overhead than C calls). – PM 2Ring Apr 27 '22 at 18:05
3

It's probably simpler to just use the built-in set intersection method, but if you have lots of lists that you're comparing, it might be faster to sort the lists. Sorting the list is n ln n, but once you have them sorted, you can compare them in linear time by checking whether the elements match, and if they don't, advance to the next item in the list whose current element is smaller.

Acccumulation
  • 3,491
  • 1
  • 8
  • 12
1

Use set() to get a list of unique items in each list

list_1 = [0,0,1,2,0,0]
list_2 = [1,2,3,4,5,6]

booleans = []

set_1 = set(list_1)
set_2 = set(list_2)

if(set_1 & set_2):
  print(set_1 & set_2)
else:
  print("No common elements")

Output:

{1, 2}
SPYBUG96
  • 1,089
  • 5
  • 20
  • 38
  • 2
    Does `if(set_1 & set_2): print(set_1 & set_2)` evaluate `set_1 & set_2` twice, or is it smart enough to cache the result from the firth time? – Acccumulation Apr 25 '22 at 03:34
  • @Acccumulation you would need to set it to a variable before hand then evaluate so: `foo = set_1 & set_2` then `if(foo):` and `print(foo)` – SPYBUG96 Apr 25 '22 at 15:10
  • 2
    you can write it in one line: `print((set_1 & set_2) or "No common elements")`. Considering readability, I would not recommend this though – Aemyl Apr 25 '22 at 20:56
  • Interesting answer to the title question, although not the list of bools this specific question was looking for. I expect building a set out of the 2nd list is similar cost to checking each element for membership in another set, and then the actual intersection is fast if the sets are small (e.g. if a large array had many duplicates). So more total work but perhaps less memory touched (than bool list) if the sets are small. Does it give any guarantees about preserving order, in case anyone needs that? Like elements of the intersection appearing in the order they did in list_1 or list_2? – Peter Cordes Apr 27 '22 at 07:13
  • @PeterCordes Oops, I must have hyper focused on the original title question. This should be able to be easily modified though to answer the original question. As for your question I'm not 100% certain, looking online at this answer https://stackoverflow.com/questions/23529001/ordered-intersection-of-two-lists-in-python It looks like order is not preserved from either of the sets, but if needed this answer has a pretty good approach – SPYBUG96 Apr 27 '22 at 13:44
  • 1
    I'd recommend just leaving it in this state (maybe with a note acknowledging that it's answering a variation on the problem which people who get here from the question title might well have); existing answers already use `set(list)` for one of the inputs and check the other against it. Including the fastest answer in the benchmark shootout. – Peter Cordes Apr 27 '22 at 13:49
1

If you know the values are non-negative and the maximum value is much smaller than the length of the list, then using numpy's bincount might be a good alternative for using a set.

np.bincount(list_1).astype(bool)[list_2]

If list_1 and list_2 happen to be numpy arrays, this can even be a lot faster than the set + list-comprehension solution. (In my test 263 µs vs 7.37 ms; but if they're python lists, it's slightly slower than the set solution, with 8.07 ms)

towr
  • 4,147
  • 3
  • 20
  • 30
  • NB `np.bincount` has a parameter `minlength` that defaults to the maximum value of its input. But if `list_2` contains values bigger than `list_1` things will break. So to be general you'd need to set `minlength=max(list_1.max(), list_2.max())` if they're numpy arrays (and you want to maintain speed) or `minlength=max(max(list_1), max(list_2))` otherwise. – towr Apr 26 '22 at 14:59
0

Spybug96's method will work best and fastest. If you want to get an indented object with the common elements of the two sets you can use the tuple() function on the final set:

a = set(range(1, 6))
b = set(range(3, 9))
c = a & b
print(tuple(c))