2

I have to normalize the Levenshtein distance between 0 to 1. I see different variations floating in SO.

I am thinking to adopt the following approach:

  • if two strings, s1 and s2
  • len = max(s1.length(), s2.length());
  • normalized_distance = float(len - levenshteinDistance(s1, s2)) / float(len);

Then the highest score 1.0 means an exact match and 0.0 means no match.

But I see variations here: two whole texts similarity using levenshtein distance where 1- distance(a,b)/max(a.length, b.length)

Difference in normalization of Levenshtein (edit) distance?

Explanation of normalized edit distance formula

I am wondering is there a canonical code implementation in Java? I know org.apache.commons.text only implements LevenshteinDistance and not normalized LevenshteinDistance.

https://commons.apache.org/proper/commons-text/apidocs/org/apache/commons/text/similarity/LevenshteinDistance.html

Exploring
  • 2,493
  • 11
  • 56
  • 97

3 Answers3

2

Your first answer begins with "The effects of both variants should be nearly the same". The reason normalized LevenshteinDistance doesn't exist is because you (or somebody else) hasn't seen fit to implement it. Besides, it seems a rather trivial once you have the Levenshtein distance:

private double normalizedLevenshteinDistance(double levenshtein, String s1, String s2) {
    if (s1.length() >= s2.length()) {
        return levenshtein / s1.length();
    }
    else {
        return levenshtein / s2.length();
    }
}

After 3 days, once this has been thoroughly ripped to shreds, I'll add it as a Github issue on commons-text.

kviLL
  • 211
  • 2
  • 9
hd1
  • 33,938
  • 5
  • 80
  • 91
  • the following should work - right? ```if (s1.length() > s2.length() || (s1.length() == s2.length())) { return levenshtein/s1.length(); } else { // if (s2.length() > s1.length()) return levenshtein/s2.length(); }``` – Exploring Sep 29 '20 at 06:12
  • 5
    I'd jsut write this as `return levenshtein / Math.max(s1.length(), s2.length());`. – Joachim Sauer Sep 29 '20 at 06:24
  • 1
    I prefer being explicit, @JoachimSauer – hd1 Sep 29 '20 at 06:49
  • 1
    @Exploring that should work, I just prefer having my conditions explicitly stated. – hd1 Sep 29 '20 at 06:50
  • 1
    An important characteristic of any distance function is that it represents [a metric](https://en.wikipedia.org/wiki/Metric_(mathematics)). See https://commons.apache.org/proper/commons-text/apidocs/org/apache/commons/text/similarity/EditDistance.html Normalizing with max(s1.length, s2.length) will break triangle inequality. I think this is why there's no such implementation in commons-text. – COME FROM Sep 29 '20 at 08:39
  • @COMEFROM I have not understood this concept of inequality. Can you please give me an example. Thanks for your help. – Exploring Sep 29 '20 at 10:00
  • See [triange inequality](https://en.wikipedia.org/wiki/Triangle_inequality). It's one of the three characteristics of [a metric](https://en.wikipedia.org/wiki/Metric_(mathematics)). In layman's terms, a sensible notion of distance will not violate triangle inequality. Furthermore, implementations of [the EditDistance interface](https://commons.apache.org/proper/commons-text/apidocs/org/apache/commons/text/similarity/EditDistance.html) in the commons-text shouldn't violate triangle inequality. It's in the Javadoc. – COME FROM Sep 29 '20 at 11:36
  • 1
    @Exploring if this was a helpful answer to you, would you kindly accept it? – hd1 Sep 29 '20 at 17:26
1

It seems you need a measure of similarity rather than an actual measure of distance.

A proper measure of distance should obey the rules of metric like the Javadoc of the interface EditDistance in Commons Text says. There is a reason Commons Text does not include an implementation for normalized Levenshtein distance. It can be done properly, but I doubt the results would be useful.

However, using Levenshtein distance to define a measure of similarity like you suggested will work.

Apache Commons Text already has some implementations for measuring similarity. Perhaps JaroWinklerSimilarity would fit the bill.

I'd consider writing an implementation for the SimilarityScore interface using Levenshtein distance like you suggested. It will produce slightly different results than JaroWinklerSimilarity. Using the interface for your own implementation would allow changing it easily to any implementation provided by Commons Text. You could easily compare different algorithms.

Just make sure you don't divide with max(s1.length, s2.length) before checking it's not zero!

COME FROM
  • 2,497
  • 1
  • 14
  • 11
1

I had used a normalized edit distance or similarity (NES) which I think is very useful, defined by Daniel Lopresti and Jiangyin Zhou, in Equation (6) of their work: http://www.cse.lehigh.edu/~lopresti/Publications/1996/sdair96.pdf.

The NES in python is:

import math
def normalized_edit_similarity(m, d):
    # d : edit distance between the two strings
    # m : length of the shorter string
    return ( 1.0 / math.exp( d / (m - d) ) )

print(normalized_edit_similarity(3, 0))
print(normalized_edit_similarity(3, 1))
print(normalized_edit_similarity(4, 1))
print(normalized_edit_similarity(5, 1))
print(normalized_edit_similarity(5, 2))

1.0
0.6065306597126334
0.7165313105737893
0.7788007830714049
0.513417119032592

More examples can be found in Table 2 in the above paper.

The variable m in the above function can be replaced with the length of the longer string to fit your need.

See also: https://stackoverflow.com/a/71266201/8583170 (I have not yet familiar with how to answer similar questions with the same answer).

Sam Tseng
  • 178
  • 6