9

fuzzywuzzy is a very popular library for string matching. As per the documentation of the library, it is mentioned that it uses Levenshtein distance for computing the differences between sequences. But upon close inspection, I find that it actually uses the SequenceMatcher function from the difflib library. And this function, as per documentation uses the Ratcliff/Obershelp pattern-matching algorithm.

As per the definitions, Levenshtein distance is the minimum number of edits needed to transform one string into the other, and Ratcliff/Obershelp pattern-matching algorithm computes the doubled number of matching characters divided by the total number of characters in the two strings. A close related post comparing both.

And when I run an example, I get the same result for SequenceMatcher and ratio function in fuzzywuzzy.

from difflib import SequenceMatcher
from fuzzywuzzy import fuzz
s = SequenceMatcher(None, "abcd", "bcde")
s.ratio()
# 0.75
fuzz.ratio("abcd", "bcde")
# 75

If I compute the Levenshtein distance manually between the two strings, I guess it will be just 2. In this case, how does it become that it uses Levenshtein distance as the contributors write in the documentation?

prashanth
  • 4,197
  • 4
  • 25
  • 42
  • 1
    Documentation may be outdated .. surely the best place to ask is by creating an issue on the github site. – DisappointedByUnaccountableMod Dec 30 '18 at 18:24
  • I'd suggest check other implementation of Levenshtein and Ratcliff/Obershelp algorithms, checkout [textdistance](https://theautomatic.net/2019/11/13/guide-to-fuzzy-matching-with-python/) and some of the examples. – Ibrahim.H Apr 13 '23 at 15:06

2 Answers2

6

FuzzyWuzzy.ratio using python-Levenshtein doesn't return the Levenshtein score, but rather the Levenshtein ratio, which is (a+b - LevenshteinScore)/(a+b), where a and b are the lengths of the two strings being compared.

If you don't have python-Levenshtein installed then fuzzywuzzy doesn't use Levenshtein at all. Fuzzywuzzy's home page is misleading with regards to this, though it does recommend installing python-Levenshtein.

python-Levenshtein has some issues with installing; I used the second response to this stackoverflow question to solve it.

If you don't have python-Levenshtein installed FuzzyWuzzy uses difflib instead, which is the same for many input values, but not all. The developers recommend using python-Levenshtein. See this issue on fuzzywuzzy's git, which includes an example case where the results are different with the package as compared to without it. This probably shouldn't happen, or at least the documentation should make this clear, but FuzzyWuzzy's Devs seem content at least with the functionality.

Isaac
  • 361
  • 5
  • 18
  • 1
    I too have the same opinion that the home page is misleading. Also I tried finding the matching score with and without python-Levenshtein package, the result is same. May be the package only speeds up the calculation. The results are same. – prashanth Aug 07 '19 at 03:57
  • Check my final link - it provides an example where the results are different. I've verified that on my own system. – Isaac Aug 07 '19 at 09:47
  • 1
    just wondering why the results should be different. Does it mean that the scores will be completely different with and without the package. Usually this should not be the case. – prashanth Aug 08 '19 at 04:00
  • 1
    Completely different is maybe an overstatement. There are situations where the two are different. I agree however that this shouldn't happen, but FuzzyWuzzy's Devs seem content; they closed this issue as if it was solved without altering anything. – Isaac Aug 08 '19 at 07:37
0

Found an excellent article from the creator of FuzzyWuzzy here.

String Similarity The simplest way to compare two strings is with a measurement of edit distance. For example, the following two strings are quite similar: NEW YORK METS NEW YORK MEATS Looks like a harmless misspelling. Can we quantify it? Using python’s difflib, that’s pretty easy

from difflib import SequenceMatcher 
m = SequenceMatcher(None,"NEW YORK METS", "NEW YORK MEATS") 
m.ratio() ⇒ 0.962962962963 

So it looks like these two strings are about 96% the same. Pretty good! We use this pattern so frequently, we wrote a helper method to encapsulate it

fuzz.ratio("NEW YORK METS", "NEW YORK MEATS") ⇒ 96
jumping_monkey
  • 5,941
  • 2
  • 43
  • 58