1

I have two lists of strings, A and B. For each string in A, I'd like to compare it to every string in B and select the most similar match. The comparison function that I'm using is a custom cosine similarity measure that I found on this question. Here is how it works:

import nltk, string
from sklearn.feature_extraction.text import TfidfVectorizer
nltk.download('punkt')

stemmer = nltk.stem.porter.PorterStemmer()
remove_punctuation_map = dict((ord(char), None) for char in string.punctuation)

def stem_tokens(tokens):
    return [stemmer.stem(item) for item in tokens]

def normalize(text):
    return stem_tokens(nltk.word_tokenize(text.lower().translate(remove_punctuation_map)))

vectorizer = TfidfVectorizer(tokenizer=normalize, stop_words='english')

def cosine_sim(text1, text2):
    tfidf = vectorizer.fit_transform([text1, text2])
    return ((tfidf * tfidf.T).A)[0,1]

My issue is that if I have somewhat long lists (500-1000 items), and the execution starts to take five or ten minutes. Here's an example using some dummy text:

import requests    
url = 'https://gist.githubusercontent.com/WalkerHarrison/940c005aa23386a69282f373f6160221/raw/6537d999b9e39d62df3784d2d847d4a6b2602876/sample.txt'
sample = requests.get(url).text
A, B = sample[:int(len(sample)/2)], sample[int(len(sample)/2):]
A, B = list(map(''.join, zip(*[iter(A)]*100))), list(map(''.join, zip(*[iter(B)]*100)))

Now that I have two lists, each with ~500 strings (of 100 characters each), I compute the similarities and take the top one. This is done by taking a string from A, iterating through B, sorting by cosine_sim score, and then taking the last element, and then repeating for all elements in A:

matches = [(a, list(sorted([[b, cosine_sim(a, b)] 
                            for b in B], key=lambda x: x[1]))[-1])
           for a in A]

The output is a list of matches where each item contains both strings and also their calculated similarity score. That final line took 7 minutes to run though. I'm wondering if there are inefficiencies in my process that are slowing it down or if there's just a lot to compute (500*500 = 250,000 comparisons, plus sorting for the best 500 times)?

Walker Harrison
  • 527
  • 3
  • 12
  • Are you computing tfidf for each pair of documents? Aside from being extremely inefficient, that doesn't make sense, you should compute TFidf across the whole corpus, **then do pairwise similarity with cosine** – juanpa.arrivillaga Aug 22 '17 at 17:01
  • Also, are you just arbitrarily chunking your text-file into 100-length chunks? – juanpa.arrivillaga Aug 22 '17 at 17:08
  • So to your first question, I was thinking that since I was comparing each individual sentence to one an other, TFidf would be applied individually. But maybe I can isolate pairwise sentences after? To your second Q, yes I was just chunking it arbitrarily, but that was just to create an example list of strings. For the actual data I have true sentences that have been divided correctly. – Walker Harrison Aug 22 '17 at 17:16
  • No, that doesn't make any sense. TFIDF works across a whole corpus, that is what it is designed to do. That's what makes it effective. Calculating Tfidf for each pair of documents *and then calculating cosine similarity based on that tfidf*, and *then comparing distances*, isn't really valid. And it's extremely inefficient. – juanpa.arrivillaga Aug 22 '17 at 17:18
  • Okay thank you. Do you know if there's a way to make sentence-to-sentence comparisons after running TFIDF on both texts in their entirety, or do I need to find a new way to compare (short) strings? – Walker Harrison Aug 22 '17 at 17:24
  • I'm not sure if I understand your question. First of all, you should be using `sklearn.metrics.pairwise.pairwise_distances` with `metric=cosine`, this will be faster than anything you cook up yourself in vanilla python... – juanpa.arrivillaga Aug 22 '17 at 17:26
  • I've added an answer giving you the gist of the approach I would take here... – juanpa.arrivillaga Aug 22 '17 at 17:47

1 Answers1

1

The biggest issue probably is that you are computing tfidf for every pair of documents (document here merely meaning your unit of text - this could be a tweet, a sentence, a scientific paper, or a book). Also, you shouldn't cook up your own similarity measure if it already exists. Finally, sklearn has a pairwise_distance routine that does what you want and is optimized. Putting this all together, here is a sample script:

import requests
import nltk, string
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import pairwise_distances

url = 'https://gist.githubusercontent.com/WalkerHarrison/940c005aa23386a69282f373f6160221/raw/6537d999b9e39d62df3784d2d847d4a6b2602876/sample.txt'
sample = requests.get(url).text.split('\n\n') # I'm splitting the document by "paragraphs" since it is unclear what you actually want

stemmer = nltk.stem.porter.PorterStemmer()
remove_punctuation_map = dict((ord(char), None) for char in string.punctuation)

def stem_tokens(tokens):
    return [stemmer.stem(item) for item in tokens]

def normalize(text):
    return stem_tokens(nltk.word_tokenize(text.lower().translate(remove_punctuation_map)))

vectorizer = TfidfVectorizer(tokenizer=normalize, stop_words='english')


doc_vectors = vectorizer.fit_transform(sample)
distances = pairwise_distances(doc_vectors, metric='cosine')


row_idx = list(enumerate(distances.argmax(axis=1)))
sorted_pairs = sorted(row_idx, key= lambda x: distances[x[0], x[1]], reverse=True)

# most similar documents:

i1, i2 = sorted_pairs[0] # index of most similar pair

print(sample[i1])
print("=="*50)
print(sample[i2])

There were 99 documents in my sample list, and this ran pretty much instantaneously after the download was complete. Also, the output:

Art party taxidermy locavore 3 wolf moon occupy. Tote bag twee tacos listicle, butcher single-origin coffee raclette gentrify raw denim helvetica kale chips shaman williamsburg man braid. Poke normcore lomo health goth waistcoat kogi. Af next level banh mi, deep v locavore asymmetrical snackwave chillwave. Subway tile viral flexitarian pok pok vegan, cardigan health goth venmo artisan. Iceland next level twee adaptogen, dreamcatcher paleo lyft. Selfies shoreditch microdosing vape, knausgaard hot chicken pitchfork typewriter polaroid lyft skateboard ethical distillery. Farm-to-table blue bottle yr artisan wolf try-hard vegan paleo knausgaard deep v salvia ugh offal snackwave. Succulents taxidermy cornhole wayfarers butcher, street art polaroid jean shorts williamsburg la croix tumblr raw denim. Hot chicken health goth taiyaki truffaut pop-up iceland shoreditch fingerstache.

====================================================================================================

Organic microdosing keytar thundercats chambray, cray raclette. Seitan irony raclette chia, cornhole YOLO stumptown. Gluten-free palo santo beard chia. Whatever bushwick stumptown seitan cred quinoa. Small batch selfies portland, cardigan you probably haven't heard of them shabby chic yr four dollar toast flexitarian palo santo beard offal migas. Kinfolk pour-over glossier, hammock poutine pinterest coloring book kitsch adaptogen wayfarers +1 tattooed lomo yuccie vice. Plaid fixie portland, letterpress knausgaard sartorial live-edge. Austin adaptogen YOLO cloud bread wayfarers cliche hammock banjo. Sustainable organic air plant mustache.

juanpa.arrivillaga
  • 88,713
  • 10
  • 131
  • 172
  • Thanks for this, starting to make more sense now. So if I replace `'\n\n'` with `'.'` I should get the same idea applied to sentences as opposed to paragraphs? – Walker Harrison Aug 22 '17 at 18:04
  • @WalkerHarrison well, only if `.` is a correct sentence delimiter, which is not always the case. But yes, basically. – juanpa.arrivillaga Aug 22 '17 at 18:05
  • so last thing (I think): since we're using `argmax` I think we're actually finding the least similar docs. So we could either take the `argmin` or the `argmax` of the complement. And one last twist—since each doc is it's own closest comparison, cells that are 0 (or 1 for complement) need to be changed or ignored, I think. – Walker Harrison Aug 22 '17 at 19:10