4

I have the following implementation, but I want to add a threshold, so if the result is going to be greater than it, just stop calculating and return.

How would I go about that?

EDIT: Here is my current code, threshold is not yet used...the goal is that it is used

    public static int DamerauLevenshteinDistance(string string1, string string2, int threshold)
    {
        // Return trivial case - where they are equal
        if (string1.Equals(string2))
            return 0;

        // Return trivial case - where one is empty
        if (String.IsNullOrEmpty(string1) || String.IsNullOrEmpty(string2))
            return (string1 ?? "").Length + (string2 ?? "").Length;


        // Ensure string2 (inner cycle) is longer
        if (string1.Length > string2.Length)
        {
            var tmp = string1;
            string1 = string2;
            string2 = tmp;
        }

        // Return trivial case - where string1 is contained within string2
        if (string2.Contains(string1))
            return string2.Length - string1.Length;

        var length1 = string1.Length;
        var length2 = string2.Length;

        var d = new int[length1 + 1, length2 + 1];

        for (var i = 0; i <= d.GetUpperBound(0); i++)
            d[i, 0] = i;

        for (var i = 0; i <= d.GetUpperBound(1); i++)
            d[0, i] = i;

        for (var i = 1; i <= d.GetUpperBound(0); i++)
        {
            for (var j = 1; j <= d.GetUpperBound(1); j++)
            {
                var cost = string1[i - 1] == string2[j - 1] ? 0 : 1;

                var del = d[i - 1, j] + 1;
                var ins = d[i, j - 1] + 1;
                var sub = d[i - 1, j - 1] + cost;

                d[i, j] = Math.Min(del, Math.Min(ins, sub));

                if (i > 1 && j > 1 && string1[i - 1] == string2[j - 2] && string1[i - 2] == string2[j - 1])
                    d[i, j] = Math.Min(d[i, j], d[i - 2, j - 2] + cost);
            }
        }

        return d[d.GetUpperBound(0), d.GetUpperBound(1)];
    }
}
CaffGeek
  • 21,856
  • 17
  • 100
  • 184
  • This answer : http://stackoverflow.com/a/9454016/461444 gives an implementation which seems to perform really really well according my own tests. – AFract Dec 26 '14 at 18:35

4 Answers4

5

This is Regarding ur answer this: Damerau - Levenshtein Distance, adding a threshold (sorry can't comment as I don't have 50 rep yet)

I think you have made an error here. You initialized:

var minDistance = threshold;

And ur update rule is:

if (d[i, j] < minDistance)
   minDistance = d[i, j];

Also, ur early exit criteria is:

if (minDistance > threshold)
   return int.MaxValue;

Now, observe that the if condition above will never hold true! You should rather initialize minDistance to int.MaxValue

Community
  • 1
  • 1
Urvang Joshi
  • 86
  • 1
  • 2
1

Here's the most elegant way I can think of. After setting each index of d, see if it exceeds your threshold. The evaluation is constant-time, so it's a drop in the bucket compared to the theoretical N^2 complexity of the overall algorithm:

public static int DamerauLevenshteinDistance(string string1, string string2, int threshold)
{
    ...

    for (var i = 1; i <= d.GetUpperBound(0); i++)
    {
        for (var j = 1; j <= d.GetUpperBound(1); j++)
        {
            ...

            var temp = d[i,j] = Math.Min(del, Math.Min(ins, sub));

            if (i > 1 && j > 1 && string1[i - 1] == string2[j - 2] && string1[i - 2] == string2[j - 1])
                temp = d[i,j] = Math.Min(temp, d[i - 2, j - 2] + cost);

            //Does this value exceed your threshold? if so, get out now
            if(temp > threshold) 
              return temp;
        }
    }

    return d[d.GetUpperBound(0), d.GetUpperBound(1)];
}
CaffGeek
  • 21,856
  • 17
  • 100
  • 184
KeithS
  • 70,210
  • 21
  • 112
  • 164
  • almost! I tweaked it to get it working, `d[i,j]` had to be set, for some reason, so I just had temp get set at the same time, then checked, and it now works perfectly! Thanks! – CaffGeek Oct 01 '10 at 20:15
  • I was mistaken, this isn't working... even if the result should have been 1, if I pass in a threshold of 2, the result is 3 – CaffGeek Oct 01 '10 at 21:00
1

You also asked this as a SQL CLR UDF question so I'll answer in that specific context: you best optmiziation won't come from optimizing the Levenshtein distance, but from reducing the number of pairs you compare. Yes, a faster Levenshtein algorithm will improve things, but not nearly as much as reducing the number of comparisons from N square (with N in the millions of rows) to N*some factor. My proposal is to compare only elements who have the length difference within a tolerable delta. On your big table, you add a persisted computed column on LEN(Data) and then create an index on it with include Data:

ALTER TABLE Table ADD LenData AS LEN(Data) PERSISTED;
CREATE INDEX ndxTableLenData on Table(LenData) INCLUDE (Data);

Now you can restrict the sheer problem space by joining within an max difference on lenght (eg. say 5), if your data's LEN(Data) varies significantly:

SELECT a.Data, b.Data, dbo.Levenshtein(a.Data, b.Data)
FROM Table A
JOIN Table B ON B.DataLen BETWEEN A.DataLen - 5 AND A.DataLen+5
Remus Rusanu
  • 288,378
  • 40
  • 442
  • 569
  • I was able to significantly improve performance by joining my tables with a `soundex` then applying the levenshtein distance – CaffGeek Oct 04 '10 at 21:28
  • 1
    You should also try adding a persisted column as SOUNDEX and then add the index on (SOUNDEX) with include (Data). – Remus Rusanu Oct 04 '10 at 21:42
  • I agree that reducing the number of comparisons is better; however I have reduced my comparisons down from 1.2 exa comparasions to 127.8 giga comparisons. Now I need a better Levenstien. At this point I need to bring down the computation from 3.5 days to within 10 hours. –  Dec 16 '10 at 22:09
0

Finally got it...though it's not as beneficial as I had hoped

    public static int DamerauLevenshteinDistance(string string1, string string2, int threshold)
    {
        // Return trivial case - where they are equal
        if (string1.Equals(string2))
            return 0;

        // Return trivial case - where one is empty
        if (String.IsNullOrEmpty(string1) || String.IsNullOrEmpty(string2))
            return (string1 ?? "").Length + (string2 ?? "").Length;


        // Ensure string2 (inner cycle) is longer
        if (string1.Length > string2.Length)
        {
            var tmp = string1;
            string1 = string2;
            string2 = tmp;
        }

        // Return trivial case - where string1 is contained within string2
        if (string2.Contains(string1))
            return string2.Length - string1.Length;

        var length1 = string1.Length;
        var length2 = string2.Length;

        var d = new int[length1 + 1, length2 + 1];

        for (var i = 0; i <= d.GetUpperBound(0); i++)
            d[i, 0] = i;

        for (var i = 0; i <= d.GetUpperBound(1); i++)
            d[0, i] = i;

        for (var i = 1; i <= d.GetUpperBound(0); i++)
        {
            var im1 = i - 1;
            var im2 = i - 2;
            var minDistance = threshold;

            for (var j = 1; j <= d.GetUpperBound(1); j++)
            {
                var jm1 = j - 1;
                var jm2 = j - 2;
                var cost = string1[im1] == string2[jm1] ? 0 : 1;

                var del = d[im1, j] + 1;
                var ins = d[i, jm1] + 1;
                var sub = d[im1, jm1] + cost;

                //Math.Min is slower than native code
                //d[i, j] = Math.Min(del, Math.Min(ins, sub));
                d[i, j] = del <= ins && del <= sub ? del : ins <= sub ? ins : sub;

                if (i > 1 && j > 1 && string1[im1] == string2[jm2] && string1[im2] == string2[jm1])
                    d[i, j] = Math.Min(d[i, j], d[im2, jm2] + cost);

                if (d[i, j] < minDistance)
                    minDistance = d[i, j];
            }

            if (minDistance > threshold)
                return int.MaxValue;
        }

        return d[d.GetUpperBound(0), d.GetUpperBound(1)] > threshold 
            ? int.MaxValue 
            : d[d.GetUpperBound(0), d.GetUpperBound(1)];
    }
CaffGeek
  • 21,856
  • 17
  • 100
  • 184
  • 1
    It's easy to see why this isn't very beneficial. You set the minDistance to be the threshold, then replace it only with smaller values, then test to see if minDistance remained the same or grew over calculation of the cost of a string. For the test to exit processing early, every index of d[i] must result in a cost greater than the threshold, and since this algorithm will never reduce a cost it has calculated, that's extremely pessimistic. – KeithS Oct 04 '10 at 15:09
  • From what I can tell, each inner/`j` loop, must complete or the results become incorrect. The last item in the row is the maximum distance the transformation can take. The lowest value, is the minimum distance currently possible. Which is why I keep track of the lowest per row, and if it's already above the threshold, return. It should prevent several outer/`i` loops in strings that are wildly different – CaffGeek Oct 04 '10 at 16:15
  • This performs bad than my current implementation which computes the exact value, even with very low threshold. – AFract Dec 26 '14 at 18:23
  • 1
    I suggest to use stackoverflow.com/a/9454016/461444 instead. It's extremely fast and seems to give accurate results. – AFract Dec 26 '14 at 18:46