2

I have two large lists containing text. X = [30,000 entries] and Y = [400 entries]

I want to find out the text which are similar in both the list using cosine similarity. Below is the code which I am trying to execute using nested for loops

vectorizer = CountVectorizer()
found_words = []
for x in X:
    for y in Y:
       vector1 = vectorizer(x.lower())
       vector2 = vectorizer(y.lower())
       sim = cosine_similarity(vector1, vector2)
       if sim > 0.9:
           found_words.append(x.capitalize()) 

The above code works fine but takes a lot of time to execute. Is there any other way which can be efficient in time as well as space complexity. Thank you

1 Answers1

0

Instead of cosine, you can calculate the dot-product of normalised vectors. Then, vectorisation can be done before this operation.

Here is my attempt to replicate the test with random vectors:

import numpy as np 

# assume vector dimension is 100:
a = np.random.random([30000, 100]) # X vectors
b = np.random.random([400, 100]) # Y vectors

a = np.array([[_v/np.linalg.norm(_v)] for _v in a]) # shape (30000, d, 1)
b = np.array([[_v/np.linalg.norm(_v)] for _v in b]) # shape (400, d, 1)

sims = np.tensordot(a, b, axes=([1,2], [1,2]))

print(np.where(sims > 0.87)[0]) # index of matched item in X

I reduced the threshold to 0.87 to be able to show some results in my random vectors.

Replace the random a and b with the vectorisation code:

vectorizer = CountVectorizer()
a = [vectorizer(s.lower()) for s in X]
b = [vectorizer(s.lower()) for s in Y]

Also, in the end, you need to use the X indices to get back to the actual source.

x_indices, _ = np.where(sims > 0.9)
x_indices = set(list(x_indices)) # avoid possible duplicate matches
found_words = [X[i] for i in x_indices]

If you have access to an Nvidia GPU with CUDA support, you can use that for faster/parallelized tensor operations. You can use torch to access the device:

import torch
import numpy as np

vectorizer = CountVectorizer()
a = [vectorizer(s.lower()) for s in X]
b = [vectorizer(s.lower()) for s in Y]

# normalize the vectors and also convert them to tensor types
a = torch.tensor([[_v/np.linalg.norm(_v)] for _v in a], device='cuda') # shape (30000, d, 1)
b = torch.tensor([[_v/np.linalg.norm(_v)] for _v in b], device='cuda') # shape (400, d, 1)

sims = torch.tensordot(a, b, dims=([1, 2], [1, 2])).cpu().numpy()
# shape (30000, 400)

x_indices, _ = np.where(sims > 0.9)
x_indices = set(list(x_indices)) # avoid possible duplicate matches
found_words = [X[i] for i in x_indices]
Mehdi
  • 4,202
  • 5
  • 20
  • 36