Difference in normalization of Levenshtein (edit) distance?

Question

If the Levenshtein distance between two strings, s and t is given by L(s,t),

what is the difference in the impact on the resulting heuristic of the following two different normalization approaches?

L(s,t) / [length(s) + length(t)]
L(s,t) / max[length(s), length(t)]
(L(s,t)*2) / [length(s) + length(t)]

I noticed that normalization approach 2 is recommended by the Levenshtein distance Wikipedia page but no mention is made of approach 1. Are both approaches equally valid? Just wondering if there is some mathematical justification for using one over the other.

Also, what is the difference between approach 1 and approach 3?

With the following example:

s = "Hi, my name is"
t = "Hello, my name is"
L(s,t) = 4
length(s) = 14 # (includes white space)
length(t) = 17 # (includes white space)

The Levenshtein distance given the three normalization algorithms above are:

[Approach 1]   4  /(14+17) = 0.129
[Approach 2]   4  /(17)    = 0.235
[Approach 3] (4*2)/(14+17) = 0.258

the impact on the resulting metric and the different interpretation — user2205916, Dec 09 '16 at 18:46
Following up on the earlier comment, what are you intending to do with the normalized differences? I suspect that the answer to this question will entirely depend on that. — templatetypedef, Dec 09 '16 at 19:18
@templatetypedef Just trying to find a measure of similarity between corresponding elements of two vectors, with each distance for a given pair comparable to another pair's distance. — user2205916, Dec 09 '16 at 20:37
But what are you then doing with that similarity metric? It's hard to judge how "good" a metric will be without knowledge of how it's going to be used. — templatetypedef, Dec 09 '16 at 20:45
I think the ultimate goal is to get a distribution of the metric's values for an idea of the range of variation across pairs of strings — user2205916, Dec 09 '16 at 23:07

score 5 · Accepted Answer · edited Mar 04 '22 at 03:19

5

The effects of both variants should be nearly the same.

The second approach covers a range from 0 (strings are equal) to 1 (completely different)...
while the upper range in the first variant depends on the length of the strings: if the lengths are nearly equal the upper bound is 0.5, and increases on larger differences between the lengths.

edited Mar 04 '22 at 03:19

smci

32,567
20
113
146

answered Dec 09 '16 at 19:50

clemens

16,716
11
50
65

Difference in normalization of Levenshtein (edit) distance?

1 Answers1

Linked