6

If the Levenshtein distance between two strings, s and t is given by L(s,t),

what is the difference in the impact on the resulting heuristic of the following two different normalization approaches?

  1. L(s,t) / [length(s) + length(t)]

  2. L(s,t) / max[length(s), length(t)]

  3. (L(s,t)*2) / [length(s) + length(t)]

I noticed that normalization approach 2 is recommended by the Levenshtein distance Wikipedia page but no mention is made of approach 1. Are both approaches equally valid? Just wondering if there is some mathematical justification for using one over the other.

Also, what is the difference between approach 1 and approach 3?

With the following example:

s = "Hi, my name is"
t = "Hello, my name is"
L(s,t) = 4
length(s) = 14 # (includes white space)
length(t) = 17 # (includes white space)

The Levenshtein distance given the three normalization algorithms above are:

[Approach 1]   4  /(14+17) = 0.129
[Approach 2]   4  /(17)    = 0.235
[Approach 3] (4*2)/(14+17) = 0.258
smci
  • 32,567
  • 20
  • 113
  • 146
user2205916
  • 3,196
  • 11
  • 54
  • 82
  • 1
    The impact on what exactly? – kraskevich Dec 09 '16 at 18:33
  • the impact on the resulting metric and the different interpretation – user2205916 Dec 09 '16 at 18:46
  • 1
    Following up on the earlier comment, what are you intending to do with the normalized differences? I suspect that the answer to this question will entirely depend on that. – templatetypedef Dec 09 '16 at 19:18
  • @templatetypedef Just trying to find a measure of similarity between corresponding elements of two vectors, with each distance for a given pair comparable to another pair's distance. – user2205916 Dec 09 '16 at 20:37
  • 1
    But what are you then doing with that similarity metric? It's hard to judge how "good" a metric will be without knowledge of how it's going to be used. – templatetypedef Dec 09 '16 at 20:45
  • I think the ultimate goal is to get a distribution of the metric's values for an idea of the range of variation across pairs of strings – user2205916 Dec 09 '16 at 23:07

1 Answers1

5

The effects of both variants should be nearly the same.

  • The second approach covers a range from 0 (strings are equal) to 1 (completely different)...
  • while the upper range in the first variant depends on the length of the strings: if the lengths are nearly equal the upper bound is 0.5, and increases on larger differences between the lengths.
smci
  • 32,567
  • 20
  • 113
  • 146
clemens
  • 16,716
  • 11
  • 50
  • 65