Instead of cosine, you can calculate the dot-product of normalised vectors. Then, vectorisation can be done before this operation.
Here is my attempt to replicate the test with random vectors:
import numpy as np
# assume vector dimension is 100:
a = np.random.random([30000, 100]) # X vectors
b = np.random.random([400, 100]) # Y vectors
a = np.array([[_v/np.linalg.norm(_v)] for _v in a]) # shape (30000, d, 1)
b = np.array([[_v/np.linalg.norm(_v)] for _v in b]) # shape (400, d, 1)
sims = np.tensordot(a, b, axes=([1,2], [1,2]))
print(np.where(sims > 0.87)[0]) # index of matched item in X
I reduced the threshold to 0.87
to be able to show some results in my random vectors.
Replace the random a
and b
with the vectorisation code:
vectorizer = CountVectorizer()
a = [vectorizer(s.lower()) for s in X]
b = [vectorizer(s.lower()) for s in Y]
Also, in the end, you need to use the X
indices to get back to the actual source.
x_indices, _ = np.where(sims > 0.9)
x_indices = set(list(x_indices)) # avoid possible duplicate matches
found_words = [X[i] for i in x_indices]
If you have access to an Nvidia GPU with CUDA support, you can use that for faster/parallelized tensor operations. You can use torch
to access the device:
import torch
import numpy as np
vectorizer = CountVectorizer()
a = [vectorizer(s.lower()) for s in X]
b = [vectorizer(s.lower()) for s in Y]
# normalize the vectors and also convert them to tensor types
a = torch.tensor([[_v/np.linalg.norm(_v)] for _v in a], device='cuda') # shape (30000, d, 1)
b = torch.tensor([[_v/np.linalg.norm(_v)] for _v in b], device='cuda') # shape (400, d, 1)
sims = torch.tensordot(a, b, dims=([1, 2], [1, 2])).cpu().numpy()
# shape (30000, 400)
x_indices, _ = np.where(sims > 0.9)
x_indices = set(list(x_indices)) # avoid possible duplicate matches
found_words = [X[i] for i in x_indices]