0

I am writing a function to determine the "similarity" between 2 sentences. I begin by simply using python's SequenceMatcher in difflib but I obtain poor results in sentences where the words were "swapped".

For example:

  • The ball was hit by Carl
  • Carl hit the ball

So I write the following function in order to solve this issue:

def similarity(a, b):
    a = a.lower()
    b = b.lower()

    a_tokens = a.split(" ")
    a_permutations = list(" ".join(word) for word in itertools.permutations(a_tokens))


    result = 0
    for a_permutation in a_permutations:
            similarity = SequenceMatcher(lambda w: is_stop_word(w), a_permutation, b).ratio()
            if similarity > result:
                result = similarity
    return result

The function works fine and gives better results but I am concerned it could take too long for big inputs. Any recommendation on how to improve it?

Priyank
  • 1,513
  • 1
  • 18
  • 36
NMO
  • 293
  • 1
  • 10
  • How do you define your similarity metric? – L3viathan Nov 15 '16 at 15:01
  • 2
    `lambda w: is_stop_word(w)` is equivalent to `is_stop_word` – Billy Nov 15 '16 at 15:02
  • Your similarity rating is very vague - should the sentence *"The ball hit Carl"* be __closer__ to *"Carl hit the ball"* than *"The ball was hit by Carl"*, even though they mean opposite things? – Billy Nov 15 '16 at 15:04
  • Are you looking to compare by meaning? SequenceMatcher cannot compare for similarity by the meaning of the two sentences, it treats them only as two sequences, without meaning. If you are, check this out: [link](http://stackoverflow.com/questions/8897593/similarity-between-two-text-documents) – Priyank Nov 15 '16 at 15:12
  • I am not trying to make a semantic comparation. I just want a string distance function that is robust against words swaps. – NMO Nov 15 '16 at 15:28
  • 1
    Try this answer? [taken from prev link](http://stackoverflow.com/a/8897648/5699807) – Priyank Nov 15 '16 at 15:36
  • "string distance function" with "robust against word swaps" seems to be a bit of an oxymoron - they're competing concepts, if not completely orthogonal. As already suggested, you need to start with a precise definition of what metric you are trying to calculate. – twalberg Nov 15 '16 at 16:02

0 Answers0