1

Let's say I have two very large lists (e.g. 10 million rows) with some values or strings. I would like to figure out how many items from list1 are in list2.

As such this can be done by:

true_count = 0
false_count = 0
for i, x in enumerate(list1):
    print(i)
    if x in list2:
        true_count += 1
    else:
        false_count += 1

print(true_count)
print(false_count)

This will do the trick, however, if you have 10 million rows, this could take quite some time. Is there some sweet function I don't know about that can do this much faster, or something entirely different?

marc_s
  • 732,580
  • 175
  • 1,330
  • 1,459
Denver Dang
  • 2,433
  • 3
  • 38
  • 68

3 Answers3

1

Using Pandas

Here's how you will do it using Pandas dataframe.

import pandas as pd
import random
list1 = [random.randint(1,10) for i in range(10)]
list2 = [random.randint(1,10) for i in range(10)]

df1 = pd.DataFrame({'list1':list1})
df2 = pd.DataFrame({'list2':list2})

print (df1)
print (df2)

print (all(df2.list2.isin(df1.list1).astype(int)))

I am just picking 10 rows and generating 10 random numbers:

List 1:

   list1
0      3
1      5
2      4
3      1
4      5
5      2
6      1
7      4
8      2
9      5

List 2:

   list2
0      2
1      3
2      2
3      4
4      3
5      5
6      5
7      1
8      4
9      1

The output of the if statement will be:

True

The random lists I checked against are:

list1 = [random.randint(1,100000) for i in range(10000000)]
list2 = [random.randint(1,100000) for i in range(5000000)]

Ran a test with 10 mil. random numbers in list1, 5 mil. random numbers in list2, result on my mac came back in 2.207757880999999 seconds

Using Set

Alternate, you can also convert the list into a set and check if one set is a subset of the other.

set1 = set(list1)
set2 = set(list2)
print (set2.issubset(set1))

Comparing the results of the run, set is also fast. It came back in 1.6564296570000003 seconds

Joe Ferndz
  • 8,417
  • 2
  • 13
  • 33
1

You can convert the lists to sets and compute the length of the intersection between them.

len(set(list1) & set(list2))
Juan Pablo
  • 317
  • 2
  • 8
1

You will have to use Numpy array to translate the lists into a np.array()

After that, both lists will be considered as np.array objects, and because they have only one dimension you can use np.intersect() and count the common items with .size

import numpy as np


lst = [1, 7, 0, 6, 2, 5, 6]

lst2 = [1, 8, 0, 6, 2, 4, 6]

a_list=np.array(lst)
b_list=np.array(lst2)

c = np.intersect1d(a_list, b_list) 

print (c.size)
felipon
  • 11
  • 1