2

I am attempting to build a list of similar sentences across a collection of 10 or so documents. I am using the FuzzyWuzzy library in Python to determine similarity, and although my current algorithm works, it is not very efficient and takes forever to run.

for doc in docs:
        for sentence in doc.sentences:
            if len(sentence) > 8:
                for document in docs:
                    if similarity(document,doc)["ratio"] < 100:
                        for sentn in document.sentences:
                            if len(sentn) > 8:
                                simil = similarity(sentence,sentn)
                                if simil["ratio"] > 60:
                                    count+=1
                                    print count
                                    pairs.append([sentence,sentn,simil])

in case you don't feel like reading that mess of code, it takes each document in the list, then iterates over each sentence in it, then it takes that sentence and compares it to every other sentence in every other document, meaning it is processing billions of possible combinations, many of them with similarities of less than 5%, which is terribly inefficient and a waste of processing power, is there a more efficient algorithm or way in which to process the documents?

EDIT:

At Starks suggestion I added this line of code

if abs(len(sentence)-len(sentn))<10:
    simil = similarity(sentence,sentn)
    ...

There is a marked performance increase, but I still can't help but feel the algorithm is inefficient

Note: This is not a duplicate question, the other question asks how to figure out if two sentences are similar, I already can do that, what I need to know is how to do it efficiently, a lot of times

Devon M
  • 751
  • 2
  • 9
  • 24
  • You need some metric for sentences and only compare sentences that have a metric within a range. Could be "max word length" or "total number of x's" but should be based on some characteristic that has meaning for the types of comparisons you are doing. – stark Sep 01 '15 at 18:52
  • @stark I'm not entirely clear on what you're saying, do you mean I should check each sentence against a standard before running `smiliarity()`? Like check to make sure the sentences are of similar length? – Devon M Sep 01 '15 at 18:59
  • possible duplicate of [How to calculate cosine similarity given 2 sentence strings? - Python](http://stackoverflow.com/questions/15173225/how-to-calculate-cosine-similarity-given-2-sentence-strings-python) – Jim Mischel Sep 01 '15 at 19:02
  • Should be absolute value < N to find similar lengths. – stark Sep 01 '15 at 19:08
  • You might consider shingling: https://en.wikipedia.org/wiki/W-shingling. There's even a Python package: https://pypi.python.org/pypi/shingling/. How well it will work for you is unknown. You haven't really defined what "similarity" means in your case. Same words in the same order? Same words in different order? Same "meaning?" Same characters but different order, etc. – Jim Mischel Sep 01 '15 at 20:17

1 Answers1

0

There are at least two issues with that loop that are causing major bottlenecks.

First, you are taking the first sentence from your first document and checking it against every sentence of every document, including itself. So instead of

 for doc in docs:
     for sentence in doc.sentences:
         if len(sentence) > 8:
             for document in docs:

try

for doc in docs:
        for document in docs:
                for sentence in doc.sentences:
                     if len(sentence) > 8:

Second,

if similarity(document,doc)["ratio"] < 100:

is not very efficient, you don't need to use fuzzy matching to tell if two documents are identical, you can just use

if document.text == doc.text:
Devon M
  • 751
  • 2
  • 9
  • 24