90

What I am striving to complete is a program which reads in a file and will compare each sentence according to the original sentence. The sentence which is a perfect match to the original will receive a score of 1 and a sentence which is the total opposite will receive a 0. All other fuzzy sentences will receive a grade in between 1 and 0.

I am unsure which operation to use to allow me to complete this in Python 3.

I have included the sample text in which the Text 1 is the original and the other preceding strings are the comparisons.

Text: Sample

Text 1: It was a dark and stormy night. I was all alone sitting on a red chair. I was not completely alone as I had three cats.

Text 20: It was a murky and stormy night. I was all alone sitting on a crimson chair. I was not completely alone as I had three felines // Should score high point but not 1

Text 21: It was a murky and tempestuous night. I was all alone sitting on a crimson cathedra. I was not completely alone as I had three felines // Should score lower than text 20

Text 22: I was all alone sitting on a crimson cathedra. I was not completely alone as I had three felines. It was a murky and tempestuous night. // Should score lower than text 21 but NOT 0

Text 24: It was a dark and stormy night. I was not alone. I was not sitting on a red chair. I had three cats. // Should score a 0!

San Jacinto
  • 8,774
  • 5
  • 43
  • 58
jacksonstephenc
  • 901
  • 1
  • 7
  • 3
  • 4
    Seems you want to compute the [Levenshtein distance](http://en.wikipedia.org/wiki/Levenshtein_distance) (or some other [edit distance](http://en.wikipedia.org/wiki/Edit_distance) metric). If you now the maximum distance, you just have to scale the scores to the range `[0,1]`. – Felix Kling Apr 30 '12 at 11:45
  • Thank you for your help @Felix Kling the difflib may be the route to go. – jacksonstephenc Apr 30 '12 at 12:32
  • @FelixKling Too bad it got deleted.... – Franck Dernoncourt Jun 07 '15 at 07:27
  • 1
    Why should string 1 and 24 get zero? They have exactly the same first sentence. The 2nd sentence in 1 is almost the same as sentence 2+3 in 24 (only difference is "not", and an extra "I was not").. Numerically they're VERY similar. Semantically they're different, but if you're asking for a computer to understand the meaning of a sentence, then you might be asking too much. – naught101 Jan 31 '18 at 23:49

5 Answers5

129

There is a package called thefuzz. Install via pip:

pip install thefuzz

Simple usage:

>>> from thefuzz import fuzz
>>> fuzz.ratio("this is a test", "this is a test!")
    97

The package is built on top of difflib. Why not just use that, you ask? Apart from being a bit simpler, it has a number of different matching methods (like token order insensitivity, partial string matching) which make it more powerful in practice. The process.extract functions are especially useful: find the best matching strings and ratios from a set. From their readme:

Partial Ratio

>>> fuzz.partial_ratio("this is a test", "this is a test!")
    100

Token Sort Ratio

>>> fuzz.ratio("fuzzy wuzzy was a bear", "wuzzy fuzzy was a bear")
    90
>>> fuzz.token_sort_ratio("fuzzy wuzzy was a bear", "wuzzy fuzzy was a bear")
    100

Token Set Ratio

>>> fuzz.token_sort_ratio("fuzzy was a bear", "fuzzy fuzzy was a bear")
    84
>>> fuzz.token_set_ratio("fuzzy was a bear", "fuzzy fuzzy was a bear")
    100

Process

>>> choices = ["Atlanta Falcons", "New York Jets", "New York Giants", "Dallas Cowboys"]
>>> process.extract("new york jets", choices, limit=2)
    [('New York Jets', 100), ('New York Giants', 78)]
>>> process.extractOne("cowboys", choices)
    ("Dallas Cowboys", 90)
congusbongus
  • 13,359
  • 7
  • 71
  • 99
  • Trying out fuzzywuzzy. Have found if change "New York Giants" to, say, "New York Giants Dallas Cowboys", process.extract("new york jets", choices, limit=2) produces [('New York Jets', 100), ('New York Giants Dallas Cowboys', 86)]. Do you know why the match rate for the second fuzzy match goes up? It doesn't make much sense. – ToonZ Aug 16 '16 at 17:17
  • got warning, "lib/python2.7/site-packages/fuzzywuzzy/fuzz.py:35: UserWarning: *Using slow pure-python SequenceMatcher. Install python-Levenshtein to remove this warning warnings.warn('Using slow pure-python SequenceMatcher. Install python-Levenshtein to remove this warning')*" –  Sep 01 '17 at 15:23
  • 1
    @user2738183 `pip install python-Levenshtein` FuzzyWuzzy uses difflib, which is part of the standard library. To getter better performance, however, you can install the python-Levenshtein module for sequence matching per above. https://pypi.org/project/python-Levenshtein/ – Alexander Dec 19 '18 at 20:38
  • Be aware that `python-Levenshtein` and `fuzzywuzzy` both are GPL. This means you have to open source your code under the samel licence. Thus this is not suitable for production code. – Thomas Deniffel May 15 '23 at 15:47
110

There is a module in the standard library (called difflib) that can compare strings and return a score based on their similarity. The SequenceMatcher class should do what you want.

Small example from Python prompt:

>>> from difflib import SequenceMatcher as SM
>>> s1 = ' It was a dark and stormy night. I was all alone sitting on a red chair. I was not completely alone as I had three cats.'
>>> s2 = ' It was a murky and stormy night. I was all alone sitting on a crimson chair. I was not completely alone as I had three felines.'
>>> SM(None, s1, s2).ratio()
0.9112903225806451
mac
  • 42,153
  • 26
  • 121
  • 131
  • 1
    Thank you @mac. This is exactly what I am looking for. I am just having trouble figuring out a way I can get the program to tell if a sentence is the opposite of the original. Does python have a library in which will help me with grammatical situations to solve this problem? – jacksonstephenc Apr 30 '12 at 12:30
  • @user1365664 - That sounds like a really hard problem, but the canonical answer to anything regarding natural languages in Python is NLTK, http://www.nltk.org/. – AKX Apr 30 '12 at 12:34
  • @user1365664: I doubt such a thing exists, unless you specify what "opposite" means in this case. Is `ab` the opposite of `zy`? Or is `ab` the opposite of `ba`? etc. – Felix Kling Apr 30 '12 at 12:34
  • Well in this case..."I was all alone" and "I was not alone" are opposite sentences or strings. A positive to a negative connotation. – jacksonstephenc Apr 30 '12 at 12:41
  • @user1365664 - I do concur with the other commentators! – mac Apr 30 '12 at 12:41
  • 1
    @user1365664: In this case you might want to have a look at NLP and *sentiment analysis*. – Felix Kling Apr 30 '12 at 13:07
  • Oh thanks, I will look look at the information which was provided. – jacksonstephenc Apr 30 '12 at 13:41
  • 6
    @user1365664 Don't forget to accept this answer if you are satisfied. – floer32 Apr 30 '12 at 13:57
  • I have done some experiments comparing both SequenceMatcher and fuzzywuzzy. The results are almost same. Is there any difference in algorithms used by them? – user Jun 16 '16 at 15:28
  • @user - As the asnwer about fuzzywuzzy says, fuzzyfuzzy is based on difflib, see that answer for more details. Incidentally: this answer predates fuzzywuzzy becoming a popular package by about 3 years! ;) – mac Jun 17 '16 at 21:59
17

fuzzyset is much faster than fuzzywuzzy (difflib) for both indexing and searching.

from fuzzyset import FuzzySet
corpus = """It was a murky and stormy night. I was all alone sitting on a crimson chair. I was not completely alone as I had three felines
    It was a murky and tempestuous night. I was all alone sitting on a crimson cathedra. I was not completely alone as I had three felines
    I was all alone sitting on a crimson cathedra. I was not completely alone as I had three felines. It was a murky and tempestuous night.
    It was a dark and stormy night. I was not alone. I was not sitting on a red chair. I had three cats."""
corpus = [line.lstrip() for line in corpus.split("\n")]
fs = FuzzySet(corpus)
query = "It was a dark and stormy night. I was all alone sitting on a red chair. I was not completely alone as I had three cats."
fs.get(query)
# [(0.873015873015873, 'It was a murky and stormy night. I was all alone sitting on a crimson chair. I was not completely alone as I had three felines')]

Warning: Be careful not to mix unicode and bytes in your fuzzyset.

hobs
  • 18,473
  • 10
  • 83
  • 106
9

The task is called Paraphrase Identification which is an active area of research in Natural Language Processing. I have linked several state of the art papers many of which you can find open source code on GitHub for.

Note that all the answered question assume that there is some string/surface similarity between the two sentences while in reality two sentences with little string similarity can be semantically similar.

If you're interested in that kind of similarity you can use Skip-Thoughts. Install the software according to the GitHub guides and go to paraphrase detection section in readme:

import skipthoughts
model = skipthoughts.load_model()
vectors = skipthoughts.encode(model, X_sentences)

This converts your sentences (X_sentences) to vectors. Later you can find the similarity of two vectors by:

similarity = 1 - scipy.spatial.distance.cosine(vectors[0], vectors[1])

where we are assuming vector[0] and vector1 are the corresponding vector to X_sentences[0], X_sentences1 which you wanted to find their scores.

There are other models to convert a sentence to a vector which you can find here.

Once you convert your sentences into vectors the similarity is just a matter of finding the Cosine similarity between those vectors.

Update in 2020 There is this new model called BERT released by Google based on a deep learning framework called Tensorflow. There is also an implementation that many people find easier to use called Transformers. What these programs do, is that they accept two phrases or sentences, and they are able to be trained to say if these two phrases/sentences are the same or not. To train them, you need a number of sentences with labels 1 or 0 (if they have the same meaning or not). You train these models using your training data (already labelled data), and then you'll be able to use the trained model to make prediction for a new pair of phrases/sentences. You can find how to train (they call it fine-tune) these models on their corresponding github pages or in many other places such as this.

There are also already labelled training data available in English called MRPC (microsoft paraphrase identification corpus). Note that there multilingual or language-specific versions of BERT also exists so this model can be extended (e.g. trained) in other languages as well.

Ash
  • 3,428
  • 1
  • 34
  • 44
2

There is also this fast and accurate fuzzy comparison library licensed by MIT: https://github.com/maxbachmann/RapidFuzz

Outcast
  • 4,967
  • 5
  • 44
  • 99