2

I have many tuples (e.g. (0, 1)) in two pretty long lists list_0and list_1of size ~40k elements. I need to count the tuples of list_0 also in list_1.

The following statement with list comprehension takes ~ 1 min and I need to do this multiple times, so I am looking for something more efficient :

len([element for element in list_0 if element in list_1])

What could be more efficient ?

To reproduce :

elements_0 = 200*[0]+50*[1]+150*[2]
elements_1 = 100*[1]+150*[2]+150*[1]
df = pd.DataFrame(data=[list(elements_0), list(elements_1)]).T
list_0 = [item for sublist in df.groupby(0)[0].apply(lambda x: list(combinations(x.index, 2))) for item in sublist]
list_1 = [item for sublist in df.groupby(1)[1].apply(lambda x: list(combinations(x.index, 2))) for item in sublist]

print(len([pair for pair in list_0 if pair in list_1])) # Long
Vincent
  • 1,534
  • 3
  • 20
  • 42
  • 1
    Check your variable names, this code isn't reproducible. Looks like you might be mixing up `elements_0`, `list0`, and `list_0` – G. Anderson Dec 08 '20 at 16:23
  • Yes I corrected it, thanks – Vincent Dec 08 '20 at 16:24
  • [The `numpy` answer to this question](https://stackoverflow.com/questions/64890422/np-where-checking-also-for-subelements-in-multidimensional-arrays/64891214#64891214) – Daniel F Dec 09 '20 at 08:35

3 Answers3

3

It looks like you can use

pd.Series(list_0).isin(list_1).sum()

Output:

22300
CPU times: user 14.8 ms, sys: 20 µs, total: 14.8 ms
Wall time: 14.1 ms

which should give the same answer with:

len([element for element in list_0 if element in list_1])

which gives:

22300
CPU times: user 13.8 s, sys: 0 ns, total: 13.8 s
Wall time: 13.8 s

Also merge and query:

s = df.reset_index()
print(len(s.merge(s, on=[0,1])
  .query('index_x > index_y')
))

with output:

22300
CPU times: user 13.4 ms, sys: 15 µs, total: 13.4 ms
Wall time: 12.3 ms
Quang Hoang
  • 146,074
  • 10
  • 56
  • 74
1

Make list_1 a set first then find the elements that are in it:

list_1_set = set(list_1)
print(len(element for element in list_0 if element in list_1_set))
Aplet123
  • 33,825
  • 1
  • 29
  • 55
1

Maybe you can try to sort these two lists first. Then you can use two pointers for these two lists. Compare whether the tuples on the two pointers match or not, then update the pointer to next tuple in the list. This should take O(n). Please correct me if there is anything wrong. Thanks.

Jiao Dian
  • 31
  • 5