Python Nested Loop Alternative

Question

I have two large lists containing text. X = [30,000 entries] and Y = [400 entries]

I want to find out the text which are similar in both the list using cosine similarity. Below is the code which I am trying to execute using nested for loops

vectorizer = CountVectorizer()
found_words = []
for x in X:
    for y in Y:
       vector1 = vectorizer(x.lower())
       vector2 = vectorizer(y.lower())
       sim = cosine_similarity(vector1, vector2)
       if sim > 0.9:
           found_words.append(x.capitalize())

The above code works fine but takes a lot of time to execute. Is there any other way which can be efficient in time as well as space complexity. Thank you

Similar answer is here: https://stackoverflow.com/questions/13908518/alternative-to-nesting-for-loops-in-python — habib, Feb 23 '21 at 07:31
I did look at that but since my lists are huge, it takes equal amount of time to execute compared to nested for loops — Rohan Joshi, Feb 23 '21 at 07:33
https://stackoverflow.com/questions/21850508/python-nested-list-comprehension-with-if-else — Humble_PrOgRaMeR, Feb 23 '21 at 07:37
@PujariRajagonda I checked even using List comprehension. I am not sure if it's outright efficient since I am not just comparing the text but calculating the cosine similarity here — Rohan Joshi, Feb 23 '21 at 07:56
@SandeepGusain Do you think spawning multiple processes could help? — Rohan Joshi, Feb 23 '21 at 07:57
You can split the **X** array into multiple chunks which can then be processed simultaneously — Sandeep Gusain, Feb 23 '21 at 08:08
this solution might help you reach your goal: https://stackoverflow.com/questions/18424228/cosine-similarity-between-2-number-lists — Avinash Singh, Feb 23 '21 at 08:11
What happens when you move: `vector1 = vectorizer(x.lower())` to just before `for y in Y:`? — quamrana, Feb 23 '21 at 08:43
@quamrana That's a very good point but I didn't see substantial decrease in execution time — Rohan Joshi, Feb 23 '21 at 10:55
Ok, so it looks like you will need `multiprocessing`. Anyway, you should just perform the `s.lower()` once for each string, so you might want, eg, `Y = [y.lower() for y in Y]` before the loops. — quamrana, Feb 23 '21 at 11:01

Mehdi · Answer 1 · 2021-02-23T15:22:47.230

Instead of cosine, you can calculate the dot-product of normalised vectors. Then, vectorisation can be done before this operation.

Here is my attempt to replicate the test with random vectors:

import numpy as np 

# assume vector dimension is 100:
a = np.random.random([30000, 100]) # X vectors
b = np.random.random([400, 100]) # Y vectors

a = np.array([[_v/np.linalg.norm(_v)] for _v in a]) # shape (30000, d, 1)
b = np.array([[_v/np.linalg.norm(_v)] for _v in b]) # shape (400, d, 1)

sims = np.tensordot(a, b, axes=([1,2], [1,2]))

print(np.where(sims > 0.87)[0]) # index of matched item in X

I reduced the threshold to 0.87 to be able to show some results in my random vectors.

Replace the random a and b with the vectorisation code:

vectorizer = CountVectorizer()
a = [vectorizer(s.lower()) for s in X]
b = [vectorizer(s.lower()) for s in Y]

Also, in the end, you need to use the X indices to get back to the actual source.

x_indices, _ = np.where(sims > 0.9)
x_indices = set(list(x_indices)) # avoid possible duplicate matches
found_words = [X[i] for i in x_indices]

If you have access to an Nvidia GPU with CUDA support, you can use that for faster/parallelized tensor operations. You can use torch to access the device:

import torch
import numpy as np

vectorizer = CountVectorizer()
a = [vectorizer(s.lower()) for s in X]
b = [vectorizer(s.lower()) for s in Y]

# normalize the vectors and also convert them to tensor types
a = torch.tensor([[_v/np.linalg.norm(_v)] for _v in a], device='cuda') # shape (30000, d, 1)
b = torch.tensor([[_v/np.linalg.norm(_v)] for _v in b], device='cuda') # shape (400, d, 1)

sims = torch.tensordot(a, b, dims=([1, 2], [1, 2])).cpu().numpy()
# shape (30000, 400)

x_indices, _ = np.where(sims > 0.9)
x_indices = set(list(x_indices)) # avoid possible duplicate matches
found_words = [X[i] for i in x_indices]

Python Nested Loop Alternative

1 Answers1