I am attempting to build a list of similar sentences across a collection of 10 or so documents. I am using the FuzzyWuzzy library in Python to determine similarity, and although my current algorithm works, it is not very efficient and takes forever to run.
for doc in docs:
for sentence in doc.sentences:
if len(sentence) > 8:
for document in docs:
if similarity(document,doc)["ratio"] < 100:
for sentn in document.sentences:
if len(sentn) > 8:
simil = similarity(sentence,sentn)
if simil["ratio"] > 60:
count+=1
print count
pairs.append([sentence,sentn,simil])
in case you don't feel like reading that mess of code, it takes each document in the list, then iterates over each sentence in it, then it takes that sentence and compares it to every other sentence in every other document, meaning it is processing billions of possible combinations, many of them with similarities of less than 5%, which is terribly inefficient and a waste of processing power, is there a more efficient algorithm or way in which to process the documents?
EDIT:
At Starks suggestion I added this line of code
if abs(len(sentence)-len(sentn))<10:
simil = similarity(sentence,sentn)
...
There is a marked performance increase, but I still can't help but feel the algorithm is inefficient
Note: This is not a duplicate question, the other question asks how to figure out if two sentences are similar, I already can do that, what I need to know is how to do it efficiently, a lot of times