44

According to the python-Levenshtein.ratio source:

https://github.com/miohtama/python-Levenshtein/blob/master/Levenshtein.c#L722

it's computed as (lensum - ldist) / lensum. This works for

# pip install python-Levenshtein
import Levenshtein
Levenshtein.distance('ab', 'a') # returns 1
Levenshtein.ratio('ab', 'a')    # returns 0.666666

However, it seems to break with

Levenshtein.distance('ab', 'ac')  # returns 1
Levenshtein.ratio('ab', 'ac')     # returns 0.5

I feel I must be missing something very simple.. but why not 0.75?

Franck Dernoncourt
  • 77,520
  • 72
  • 342
  • 501
cjauvin
  • 3,433
  • 4
  • 29
  • 38
  • I checked the library (at link you given) I am also confuse why he is using `sum`.Also `(1-1/3) = .666..` this correct according to code but also `(1-1/4) = 0.75` How its .5 ? not clear even in the documentation.... But The actual formula to calculate Levenshtein Distance is in my answer. – Grijesh Chauhan Jan 10 '13 at 17:13

4 Answers4

26

Levenshtein distance for 'ab' and 'ac' as below:

image

so alignment is:

  a c
  a b 

Alignment length = 2
number of mismatch = 1

Levenshtein Distance is 1 because only one substitutions is required to transfer ac into ab (or reverse)

Distance ratio = (Levenshtein Distance)/(Alignment length ) = 0.5

EDIT

you are writing

(lensum - ldist) / lensum = (1 - ldist/lensum) = 1 - 0.5 = 0.5.

But this is matching (not distance)
REFFRENCE, you may notice its written

Matching %

p = (1 - l/m) × 100

Where l is the levenshtein distance and m is the length of the longest of the two words:

(notice: some author use longest of the two, I used alignment length)

(1 - 3/7) × 100 = 57.14...  

  (Word 1    Word 2    RATIO   Mis-Match   Match%
   AB         AB         0       0        (1 - 0/2 )*100  = 100%  
   CD         AB         1       2        (1 - 2/2 )*100  = 0%   
   AB         AC        .5       1        (1 - 1/2 )*100  = 50%      

Why some authors divide by alignment length,other by max length of one of both?.., because Levenshtein don't consider gap. Distance = number of edits (insertion + deletion + replacement), While Needleman–Wunsch algorithm that is standard global alignment consider gap. This is (gap) difference between Needleman–Wunsch and Levenshtein, so much of paper use max distance between two sequences (BUT THIS IS MY OWN UNDERSTANDING, AND IAM NOT SURE 100%)

Here is IEEE TRANSACTIONS ON PAITERN ANALYSIS : Computation of Normalized Edit Distance and Applications In this paper Normalized Edit Distance as followed:

Given two strings X and Y over a finite alphabet, the normalized edit distance between X and Y, d( X , Y ) is defined as the minimum of W( P ) / L ( P )w, here P is an editing path between X and Y , W ( P ) is the sum of the weights of the elementary edit operations of P, and L(P) is the number of these operations (length of P).

Grijesh Chauhan
  • 57,103
  • 20
  • 141
  • 208
  • 1
    Thanks for the answer, it makes sense, but it doesn't address the aspect that really bothers me: that the two results (obtained with the same code) don't seem to be consistent with each other (i.e. they suggest two different ways of computing the ratio). How can this be? – cjauvin Jan 11 '13 at 14:38
  • @cjauvin Did you read my comment to your question ...I have checked and I have the same impression that according to documentation it should be `.75` but two results in your example are contradicts. – Grijesh Chauhan Jan 11 '13 at 18:34
  • 2
    Yes I've seen your comment, and that's why, although good and interesting, I cannot accept your answer as the solution, because what I'm really after is the reason for the contradiction in this particular piece of code. Maybe I should ask the PL maintainer. – cjauvin Jan 11 '13 at 19:19
  • @cjauvin By the change I am working for you ..I meaning i am looking at that file(u liked)..If in case I find something I will answer you back...give me some time... – Grijesh Chauhan Jan 11 '13 at 19:31
  • @cjauvin what is your email id? – Grijesh Chauhan Jan 11 '13 at 19:57
  • Email id? If you mean my email address, it should be in my profile. – cjauvin Jan 11 '13 at 20:07
  • @cjauvin You question is truly obvious and correct..there is doubt to me too. And on behalf of you I sent a mail to code Author. I wanted to add you in CC...Will answer you as I get a response. – Grijesh Chauhan Jan 11 '13 at 20:13
  • @cjauvin I asked the doubt there on e-mail provided and got reply :`As Github README states - I have no idea how Levenshtein works, so cannot help you . Sorry.` – Grijesh Chauhan Jan 12 '13 at 05:31
  • I discovered the answer, and posted it below. – cjauvin Jan 13 '13 at 15:11
25

By looking more carefully at the C code, I found that this apparent contradiction is due to the fact that ratio treats the "replace" edit operation differently than the other operations (i.e. with a cost of 2), whereas distance treats them all the same with a cost of 1.

This can be seen in the calls to the internal levenshtein_common function made within ratio_py function:


https://github.com/miohtama/python-Levenshtein/blob/master/Levenshtein.c#L727

static PyObject*
ratio_py(PyObject *self, PyObject *args)
{
  size_t lensum;
  long int ldist;

  if ((ldist = levenshtein_common(args, "ratio", 1, &lensum)) < 0) //Call
    return NULL;

  if (lensum == 0)
    return PyFloat_FromDouble(1.0);

  return PyFloat_FromDouble((double)(lensum - ldist)/(lensum));
}

and by distance_py function:

https://github.com/miohtama/python-Levenshtein/blob/master/Levenshtein.c#L715

static PyObject*
distance_py(PyObject *self, PyObject *args)
{
  size_t lensum;
  long int ldist;

  if ((ldist = levenshtein_common(args, "distance", 0, &lensum)) < 0)
    return NULL;

  return PyInt_FromLong((long)ldist);
}

which ultimately results in different cost arguments being sent to another internal function, lev_edit_distance, which has the following doc snippet:

@xcost: If nonzero, the replace operation has weight 2, otherwise all
        edit operations have equal weights of 1.

Code of lev_edit_distance():

/**
 * lev_edit_distance:
 * @len1: The length of @string1.
 * @string1: A sequence of bytes of length @len1, may contain NUL characters.
 * @len2: The length of @string2.
 * @string2: A sequence of bytes of length @len2, may contain NUL characters.
 * @xcost: If nonzero, the replace operation has weight 2, otherwise all
 *         edit operations have equal weights of 1.
 *
 * Computes Levenshtein edit distance of two strings.
 *
 * Returns: The edit distance.
 **/
_LEV_STATIC_PY size_t
lev_edit_distance(size_t len1, const lev_byte *string1,
                  size_t len2, const lev_byte *string2,
                  int xcost)
{
  size_t i;

[ANSWER]

So in my example,

ratio('ab', 'ac') implies a replacement operation (cost of 2), over the total length of the strings (4), hence 2/4 = 0.5.

That explains the "how", I guess the only remaining aspect would be the "why", but for the moment I'm satisfied with this understanding.

Grijesh Chauhan
  • 57,103
  • 20
  • 141
  • 208
cjauvin
  • 3,433
  • 4
  • 29
  • 38
  • The why: a guess, but since the sum of lengths is used in calculation of the ratio: r = (1- d/L)*100 , with d= lev.distance and L=sum of lengths, you need replacement to be 2 to get a ratio of 0 with a complete replacement. e.g. "abcd" -> "dcba" has a ratio of 0 ( = (1-8/8)*100 ) – dstibbe May 28 '19 at 06:17
  • Correction: "abcd" -> "xyzw" will result in a ratio of 0, "abcd" -> "dcba" will result in 25, because the distance is 6, not 8. – dstibbe May 28 '19 at 06:42
  • Thanks for the explanation, but having a few issues which is confusing me. For example lev.ratio("rogers elizabeth","elisabeth rogers") = 0.5. lensum = 32 (includes whitespaces) But this is 15 replacements which should be 15*2 = 30. ldist = 30 So (32-30)/32 = 0.0625. Why is it the pthon lev.ratio function returns 0.5 ? – SSS Dec 08 '22 at 15:19
9

(lensum - ldist) / lensum

ldist is not the distance, is the sum of costs

enter image description here

Each number of the array that is not match comes from above, from left or diagonal

If the number comes from the left he is an Insertion, it comes from above it is a deletion, it comes from the diagonal it is a replacement

The insert and delete have cost 1, and the substitution has cost 2. The replacement cost is 2 because it is a delete and insert

ab ac cost is 2 because it is a replacement

>>> import Levenshtein as lev
>>> lev.distance("ab","ac")
1
>>> lev.ratio("ab","ac")
0.5
>>> (4.0-1.0)/4.0    #Erro, the distance is 1 but the cost is 2 to be a replacement
0.75
>>> lev.ratio("ab","a")
0.6666666666666666
>>> lev.distance("ab","a")
1
>>> (3.0-1.0)/3.0    #Coincidence, the distance equal to the cost of insertion that is 1
0.6666666666666666
>>> x="ab"
>>> y="ac"
>>> lev.editops(x,y)
[('replace', 1, 1)]
>>> ldist = sum([2 for item in lev.editops(x,y) if item[0] == 'replace'])+ sum([1 for item in lev.editops(x,y) if item[0] != 'replace'])
>>> ldist
2
>>> ln=len(x)+len(y)
>>> ln
4
>>> (4.0-2.0)/4.0
0.5

enter image description here

For more information: python-Levenshtein ratio calculation

Another example:

enter image description here

The cost is 9 (4 replace => 4*2=8 and 1 delete 1*1=1, 8+1=9)

str1=len("google") #6
str2=len("look-at") #7
str1 + str2 #13

distance = 5 (According the vector (7, 6) = 5 of matrix)

ratio is (13-9)/13 = 0.3076923076923077

>>> c="look-at"
>>> d="google"
>>> lev.editops(c,d)
[('replace', 0, 0), ('delete', 3, 3), ('replace', 4, 3), ('replace', 5, 4), ('replace', 6, 5)]
>>> lev.ratio(c,d)
0.3076923076923077
>>> lev.distance(c,d)
5
Community
  • 1
  • 1
rafaelcb21
  • 12,422
  • 28
  • 62
  • 86
5

Although there's no absolute standard, normalized Levensthein distance is most commonly defined ldist / max(len(a), len(b)). That would yield .5 for both examples.

The max makes sense since it is the lowest upper bound on Levenshtein distance: to obtain a from b where len(a) > len(b), you can always substitute the first len(b) elements of b with the corresponding ones from a, then insert the missing part a[len(b):], for a total of len(a) edit operations.

This argument extends in the obvious way to the case where len(a) <= len(b). To turn normalized distance into a similarity measure, subtract it from one: 1 - ldist / max(len(a), len(b)).

Fred Foo
  • 355,277
  • 75
  • 744
  • 836
  • Hi larsmans! you are correct that `most commonly defined ldist / max(len(a), len(b))`, consider gap is [Needleman–Wunsch algorithm](http://en.wikipedia.org/wiki/Needleman%E2%80%93Wunsch_algorithm) – Grijesh Chauhan Jan 10 '13 at 15:48