I have two lists of strings, A and B. For each string in A, I'd like to compare it to every string in B and select the most similar match. The comparison function that I'm using is a custom cosine similarity measure that I found on this question. Here is how it works:
import nltk, string
from sklearn.feature_extraction.text import TfidfVectorizer
nltk.download('punkt')
stemmer = nltk.stem.porter.PorterStemmer()
remove_punctuation_map = dict((ord(char), None) for char in string.punctuation)
def stem_tokens(tokens):
return [stemmer.stem(item) for item in tokens]
def normalize(text):
return stem_tokens(nltk.word_tokenize(text.lower().translate(remove_punctuation_map)))
vectorizer = TfidfVectorizer(tokenizer=normalize, stop_words='english')
def cosine_sim(text1, text2):
tfidf = vectorizer.fit_transform([text1, text2])
return ((tfidf * tfidf.T).A)[0,1]
My issue is that if I have somewhat long lists (500-1000 items), and the execution starts to take five or ten minutes. Here's an example using some dummy text:
import requests
url = 'https://gist.githubusercontent.com/WalkerHarrison/940c005aa23386a69282f373f6160221/raw/6537d999b9e39d62df3784d2d847d4a6b2602876/sample.txt'
sample = requests.get(url).text
A, B = sample[:int(len(sample)/2)], sample[int(len(sample)/2):]
A, B = list(map(''.join, zip(*[iter(A)]*100))), list(map(''.join, zip(*[iter(B)]*100)))
Now that I have two lists, each with ~500 strings (of 100 characters each), I compute the similarities and take the top one. This is done by taking a string from A, iterating through B, sorting by cosine_sim score, and then taking the last element, and then repeating for all elements in A:
matches = [(a, list(sorted([[b, cosine_sim(a, b)]
for b in B], key=lambda x: x[1]))[-1])
for a in A]
The output is a list of matches where each item contains both strings and also their calculated similarity score. That final line took 7 minutes to run though. I'm wondering if there are inefficiencies in my process that are slowing it down or if there's just a lot to compute (500*500 = 250,000 comparisons, plus sorting for the best 500 times)?