3

Based on this paper: IEEE TRANSACTIONS ON PAITERN ANALYSIS : Computation of Normalized Edit Distance and Applications In this paper Normalized Edit Distance as followed:

Given two strings X and Y over a finite alphabet, the normalized edit distance between X and Y, d( X , Y ) is defined as the minimum of W( P ) / L ( P )w, here P is an editing path between X and Y , W ( P ) is the sum of the weights of the elementary edit operations of P, and L(P) is the number of these operations (length of P).

enter image description here

Can i safely translate the normalized edit distance algorithm explained above as this:

normalized edit distance = 
levenshtein(query 1, query 2)/max(length(query 1), length(query 2))
jxn
  • 7,685
  • 28
  • 90
  • 172

2 Answers2

2

You are probably misunderstanding the metric. There are two issues:

  1. The normalization step is to divide W(P) which is the weight of the edit procedure over L(P), which is the length of the edit procedure, not over the max length of the strings as you did;

  2. Also, the paper showed that (Example 3.1) normalized edit distance cannot be simply computed with levenshtein distance. You probably need to implement their algorithm.

An explanation of Example 3.1 (c):

From aaab to abbb, the paper used the following transformations:

  1. match a with a;
  2. skip a in the first string;
  3. skip a in the first string;
  4. skip b in the second string;
  5. skip b in the second string;
  6. match the final bs.

These are 6 operations which is why L(P) is 6; from the matrix in (a), matching has cost 0, skipping has cost 2, thus we have total cost of 0 + 2 + 2 + 2 + 2 + 0 = 8, which is exactly W(P), and W(P) / L(P) = 1.33. Similar results can be obtained for (b), which I'll left to you as exercise :-)

zw324
  • 26,764
  • 16
  • 85
  • 118
  • Thanks for the clarification. I don't quite understand the algorithm in the paper. Could you further clarify with an example? While the paper showed an example with 'aaab' and 'abbb' it didnt show how they arrived at W(P) =8 or L(P)=6 in the example 3.1. – jxn Jun 11 '15 at 17:20
  • @Jenn: Edited for the example. Thanks for introducing me to this paper! – zw324 Jun 11 '15 at 17:26
  • what is the definition of skip `a` or skip `b`. Would it not be replace instead?And is cost always 2 for skip and 0 for match? Thanks! – jxn Jun 11 '15 at 18:16
  • @Jenn: You can see from Fig. 2 (a), with skip operation being `a` or `b` matched with lambda. Diagonal entries are cost for match and the costs are 0 (lambda matching lambda doesn't make sense since lambda denotes empty), and skips are of cost 2. – zw324 Jun 11 '15 at 19:36
  • what does the 3 denote in fig 2(a)? – jxn Jun 11 '15 at 20:12
0

The 3 in figure 2(a) refers to the cost of changing "a" to "b" or the cost of changing "b" to "a". The columns with lambdas in figure 2(a) mean that it costs 2 in order to insert or delete either an "a" or a "b".

In figure 2(b), W(P) = 6 because the algorithm does the following steps:

  1. keep first a (cost 0)
  2. convert first b to a (cost 3)
  3. convert second b to a (cost 3)
  4. keep last b (cost 0)

The sum of the costs of the steps is W(P). The number of steps is 4 which is L(P).

In figure 2(c), the steps are different:

  1. keep first a (cost 0)
  2. delete first b (cost 2)
  3. delete second b (cost 2)
  4. insert a (cost 2)
  5. insert a (cost 2)
  6. keep last b (cost 0)

In this path there are six steps so the L(P) is 6. The sum of the costs of the steps is 8 so W(P) is 8. Therefore the normalized edit distance is 8/6 = 4/3 which is about 1.33.

Rudi Cilibrasi
  • 885
  • 4
  • 12