Python : Most efficient way to count elements of long list 1 in long list 2 ? (list comprehension is really slow)

Question

I have many tuples (e.g. (0, 1)) in two pretty long lists list_0and list_1of size ~40k elements. I need to count the tuples of list_0 also in list_1.

The following statement with list comprehension takes ~ 1 min and I need to do this multiple times, so I am looking for something more efficient :

len([element for element in list_0 if element in list_1])

What could be more efficient ?

To reproduce :

elements_0 = 200*[0]+50*[1]+150*[2]
elements_1 = 100*[1]+150*[2]+150*[1]
df = pd.DataFrame(data=[list(elements_0), list(elements_1)]).T
list_0 = [item for sublist in df.groupby(0)[0].apply(lambda x: list(combinations(x.index, 2))) for item in sublist]
list_1 = [item for sublist in df.groupby(1)[1].apply(lambda x: list(combinations(x.index, 2))) for item in sublist]

print(len([pair for pair in list_0 if pair in list_1])) # Long

Check your variable names, this code isn't reproducible. Looks like you might be mixing up `elements_0`, `list0`, and `list_0` — G. Anderson, Dec 08 '20 at 16:23
[The `numpy` answer to this question](https://stackoverflow.com/questions/64890422/np-where-checking-also-for-subelements-in-multidimensional-arrays/64891214#64891214) — Daniel F, Dec 09 '20 at 08:35

Quang Hoang · Accepted Answer · 2020-12-08T16:43:04.920

3

It looks like you can use

pd.Series(list_0).isin(list_1).sum()

Output:

22300
CPU times: user 14.8 ms, sys: 20 µs, total: 14.8 ms
Wall time: 14.1 ms

which should give the same answer with:

len([element for element in list_0 if element in list_1])

which gives:

22300
CPU times: user 13.8 s, sys: 0 ns, total: 13.8 s
Wall time: 13.8 s

Also merge and query:

s = df.reset_index()
print(len(s.merge(s, on=[0,1])
  .query('index_x > index_y')
))

with output:

22300
CPU times: user 13.4 ms, sys: 15 µs, total: 13.4 ms
Wall time: 12.3 ms

edited Dec 08 '20 at 16:43

answered Dec 08 '20 at 16:17

Quang Hoang

146,074
10
56
74

A typo in variable names - corrected. Thanks – Vincent Dec 08 '20 at 16:19
@Vincent see updated answer with performance comparison. – Quang Hoang Dec 08 '20 at 16:39
@Vincent updated with a slightly faster solution. – Quang Hoang Dec 08 '20 at 16:44
@Daniel F pointed this solution in numpy : https://stackoverflow.com/questions/64890422/np-where-checking-also-for-subelements-in-multidimensional-arrays/64891214#64891214 By any chance, could you please compare it with the same setup as the previous 2 ? – Vincent Dec 09 '20 at 08:55

score 1 · Answer 2 · answered Dec 08 '20 at 16:17

1

Make list_1 a set first then find the elements that are in it:

list_1_set = set(list_1)
print(len(element for element in list_0 if element in list_1_set))

answered Dec 08 '20 at 16:17

Aplet123

33,825
1
29
55

score 1 · Answer 3 · answered Dec 08 '20 at 16:20

Maybe you can try to sort these two lists first. Then you can use two pointers for these two lists. Compare whether the tuples on the two pointers match or not, then update the pointer to next tuple in the list. This should take O(n). Please correct me if there is anything wrong. Thanks.

Python : Most efficient way to count elements of long list 1 in long list 2 ? (list comprehension is really slow)

3 Answers3