1

I am trying to wrap my head around how the fuzzywuzzy library calculates the Levenshtein Distance between two strings, as the docs clearly mention that it is using that.

The Levenshtein Distance algorithm counts looks for the minimum number of edits between the two strings. That can be achieved using the addition, deletion, and substitution of a character in the string. All these operations are counted as a single operation when calculating the score.

Here are a couple of examples:

Example 1 s1 = 'hello' s2 = 'hell'

Levenshtein Score = 1 (it requires 1 edit, addition of 'o')

Example 2 s1 = 'hello' s2 = 'hella'

Levenshtein Score = 1 (it requires 1 edit, substitution of 'a' to 'o')

Plugging these scores into the Fuzzywuzzy formula (len(s1)+len(s2) - LevenshteinScore)/((len(s1)+len(s2)):

Example 1: (5+4-1)/9 = 89%

Example 2: (5+5-1)/10 = 90%

Now the fuzzywuzzy does return the same score for Example 1, but not for example 2. The score for example 2 is 80%. On investigating how it is calculating the distances under the hood, I found out that it counts the 'substitution' operation as 2 operations rather than 1 (as defined for Levenshtein). I understand that it uses the difflib library but I just want to know why is it called Levenshtein Distance, when it actually is not?

I am just trying to figure out why is there a distinction here? What does it mean or explain? Basically the reason for using 2 operations for substitution rather than one as defined in Levenshtein Distance and still calling it Levenshtein Distance. Is it got something to do with the gaps in sentences? Is this a standard way of converting LD to a normalized similarity score?

I would love if somebody could give me some insight. Also is there a better way to convert LD to a similarity score? Or in general measure the similarity between two strings? I am trying to measure the similarity between some audio file transcriptions done by a human transcription service and by an Automatic Speech Recognition system.

Thank you!

Samarth
  • 242
  • 2
  • 12

0 Answers0