2

I have the following two strings:

a = 'bjork gudmundsdottir'
b = 'b. gudmundsson gunnar'

The Levenshtein distance between the two is 12. When I use the following formula for Levenshtein distance, I get a discrepancy of 0.01 with the python-Levenshtein library:

>>> Ldist / max(len( a ), len( b ))
>>> float(12)/21
0.5714285714285714
# python-Levenshtein
Levenshtein.ratio(a,b)
0.5853658536585366
# difflib
>>> seq=difflib.SequenceMatcher(a=a,b=b)
>>> seq.ratio()
0.5853658536585366

What accounts for this difference? What am I doing incorrectly in my calculation. Note that I have reviewed this How python-Levenshtein.ratio is computed similar question and it does not quite answer what I am asking.

Could someone please explain the formula that is used to calculate the above ratio?

Community
  • 1
  • 1
David542
  • 104,438
  • 178
  • 489
  • 842
  • 1
    So why didn't you update your [original question from 12min ago](http://stackoverflow.com/questions/29401904/how-is-levenshtein-distance-related-to-ratio) with details to what's still unclear? – Lukas Graf Apr 01 '15 at 22:36
  • May be you want have a look in: http://stackoverflow.com/questions/6690739/fuzzy-string-comparison-in-python-confused-with-which-library-to-use – felipsmartins Apr 01 '15 at 22:38
  • @felipsmartins -- I've also added the output from `difflib`. – David542 Apr 01 '15 at 22:40
  • possible duplicate of [String similarity metrics in Python](http://stackoverflow.com/questions/1471153/string-similarity-metrics-in-python) – Peter Wood Apr 01 '15 at 22:41
  • 1
    So did you actually look at the accepted answer for the question that was linked as a duplicate to your [first question](http://stackoverflow.com/questions/29401904/how-is-levenshtein-distance-related-to-ratio)? `ratio()` uses a different cost for replace operations than `distance()`, hence the difference. `(lensum - ldist) / lensum = (41.0 - 17.0) / 41.0 = 0.585` – Lukas Graf Apr 01 '15 at 23:04
  • @David542 since the internal `levenshtein_common` C function isn't exposed at the Python level, it's not trivial to get at an `ldist` value with a `replace` cost of 2 to do the exact same calculation yourself. But you can look at `Levenshtein.editops(a, b)` and see that there are 5 `replace` operations (`17 - 12`) to verify that that's where the discrepancy comes from. – Lukas Graf Apr 01 '15 at 23:33
  • @LukasGraf -- ok, great. I was able to figure out the issue with your help, and I've added an answer below from your insights. Thanks! – David542 Apr 01 '15 at 23:43

1 Answers1

6

From Lukas's comment, the reason for this is that the ratio() uses a cost of 2 for replace operations rather than the normal cost of 1 for the Levenshtein distance. Here's an example calculation:

a = 'bjork gudmundsdottir'
b = 'b. gudmundsson gunnar'

>>> Levenshtein.editops(a,b)
[('delete', 1, 1), ('delete', 2, 1), ('delete', 3, 1), ('replace', 4, 1), ('replace', 14, 11), ('insert', 16, 13), ('insert', 16, 14), ('insert', 16, 15), ('insert', 16, 16), ('replace', 16, 17), ('replace', 17, 18), ('replace', 18, 19)]

>>> ldist = sum([2 for item in Levenshtein.editops(a,b) if item[0] == 'replace']) 
          + sum([1 for item in Levenshtein.editops(a,b) if item[0] != 'replace']) # 17
ln = len(a) + len(b) # 41

>>> (41.0-17.0)/41.0
0.5853658536585366
>>> Levenshtein.ratio(a,b)
0.5853658536585366
David542
  • 104,438
  • 178
  • 489
  • 842