Calculate number of items in one list are in another

Question

Let's say I have two very large lists (e.g. 10 million rows) with some values or strings. I would like to figure out how many items from list1 are in list2.

As such this can be done by:

true_count = 0
false_count = 0
for i, x in enumerate(list1):
    print(i)
    if x in list2:
        true_count += 1
    else:
        false_count += 1

print(true_count)
print(false_count)

This will do the trick, however, if you have 10 million rows, this could take quite some time. Is there some sweet function I don't know about that can do this much faster, or something entirely different?

i recommend using pandas instead of iterating thru the list. — Joe Ferndz, Jan 29 '21 at 01:48
You don't need `false_count`. It is the same as `len(list2) - true_count`. Maybe that can save some computing cycles. — accdias, Jan 29 '21 at 01:51
There is https://stackoverflow.com/questions/46862408/python-find-count-of-the-elements-of- but... — user202729, Jan 29 '21 at 01:54

Joe Ferndz · Accepted Answer · 2021-01-29T03:11:53.060

Using Pandas

Here's how you will do it using Pandas dataframe.

import pandas as pd
import random
list1 = [random.randint(1,10) for i in range(10)]
list2 = [random.randint(1,10) for i in range(10)]

df1 = pd.DataFrame({'list1':list1})
df2 = pd.DataFrame({'list2':list2})

print (df1)
print (df2)

print (all(df2.list2.isin(df1.list1).astype(int)))

I am just picking 10 rows and generating 10 random numbers:

List 1:

List 2:

The output of the if statement will be:

True

The random lists I checked against are:

list1 = [random.randint(1,100000) for i in range(10000000)]
list2 = [random.randint(1,100000) for i in range(5000000)]

Ran a test with 10 mil. random numbers in list1, 5 mil. random numbers in list2, result on my mac came back in 2.207757880999999 seconds

Using Set

Alternate, you can also convert the list into a set and check if one set is a subset of the other.

set1 = set(list1)
set2 = set(list2)
print (set2.issubset(set1))

Comparing the results of the run, set is also fast. It came back in 1.6564296570000003 seconds

score 1 · Answer 2 · answered Jan 29 '21 at 02:11

1

You can convert the lists to sets and compute the length of the intersection between them.

len(set(list1) & set(list2))

answered Jan 29 '21 at 02:11

Juan Pablo

317
2
8

1

This is wrong if there are duplicate elements in `list1`. – user202729 Jan 30 '21 at 11:55

score 1 · Answer 3 · answered Jan 29 '21 at 03:08

You will have to use Numpy array to translate the lists into a np.array()

After that, both lists will be considered as np.array objects, and because they have only one dimension you can use np.intersect() and count the common items with .size

import numpy as np


lst = [1, 7, 0, 6, 2, 5, 6]

lst2 = [1, 8, 0, 6, 2, 4, 6]

a_list=np.array(lst)
b_list=np.array(lst2)

c = np.intersect1d(a_list, b_list) 

print (c.size)

Calculate number of items in one list are in another

3 Answers3

Using Pandas

Using Set