Optimizing Levenshtein distance algorithm

Question

I have a stored procedure that uses Levenshtein distance to determine the result closest to what the user typed. The only thing really affecting the speed is the function that calculates the Levenshtein distance for all the records before selecting the record with the lowest distance (I've verified this by putting a 0 in place of the call to the Levenshtein function). The table has 1.5 million records, so even the slightest adjustment may shave off a few seconds. Right now the entire thing runs over 10 minutes. Here's the method I'm using:

ALTER function dbo.Levenshtein
( 
    @Source nvarchar(200), 
    @Target nvarchar(200) 
) 
RETURNS int
AS
BEGIN
DECLARE @Source_len int, @Target_len int, @i int, @j int, @Source_char nchar, @Dist int, @Dist_temp int, @Distv0 varbinary(8000), @Distv1 varbinary(8000)

SELECT @Source_len = LEN(@Source), @Target_len = LEN(@Target), @Distv1 = 0x0000, @j = 1, @i = 1, @Dist = 0

WHILE @j <= @Target_len
BEGIN
    SELECT @Distv1 = @Distv1 + CAST(@j AS binary(2)), @j = @j + 1
END

WHILE @i <= @Source_len
BEGIN
    SELECT @Source_char = SUBSTRING(@Source, @i, 1), @Dist = @i, @Distv0 = CAST(@i AS binary(2)), @j = 1

WHILE @j <= @Target_len
BEGIN
    SET @Dist = @Dist + 1
    SET @Dist_temp = CAST(SUBSTRING(@Distv1, @j+@j-1, 2) AS int) +
                  CASE WHEN @Source_char = SUBSTRING(@Target, @j, 1) THEN 0 ELSE 1 END

    IF @Dist > @Dist_temp
    BEGIN
        SET @Dist = @Dist_temp
    END

    SET @Dist_temp = CAST(SUBSTRING(@Distv1, @j+@j+1, 2) AS int)+1

    IF @Dist > @Dist_temp SET @Dist = @Dist_temp
    BEGIN
        SELECT @Distv0 = @Distv0 + CAST(@Dist AS binary(2)), @j = @j + 1
    END
END

SELECT @Distv1 = @Distv0, @i = @i + 1
END

RETURN @Dist
END

Where should I go from here?

store the calculated value in each row, and update if target column chnages.... — Mitch Wheat, May 27 '10 at 06:05
No I haven't profiled it... I'll have to look up how to do that, this is the first time I've tried optimizing a stored proc before. I cannot store the calculated value, this is being used for a search and inputs on the search will rarely be repeated. — Matt, May 27 '10 at 06:24

Mike Dunlavey · Answer 1 · 2010-05-27T17:02:28.643

6

The way I've done this in the past is to store the "database" (actually a dictionary of words for a spelling correcter) as a trie.

Then I used a branch-and-bound routine to look up nearest matching entries. For small distances, the time it takes is exponential in the distance. For large distances, it is linear in the size of the dictionary, just as you are seeing now.

Branch-and-bound is basically a depth-first tree walk of the trie, but with an error budget. At each node, you keep track of the current levenshtein distance, and if it exceeds the budget, you prune that branch of the tree.

First you do the walk with a budget of zero. That will only find exact matches. If you don't find a match, then you walk it with a budget of one. That will find matches at a distance of 1. If you don't find any, then you do it with a budget of 2, and so on. This sounds inefficient, but since each walk takes so much more time than the previous one, the time is dominated by the last walk that you make.

Added: outline of code (pardon my C):

// dumb version of trie node, indexed by letter. You can improve.
typedef struct tnodeTag {
  tnodeTag* p[128];
} tnode;

tnode* top; // the top of the trie

void walk(tnode* p, char* s, int budget){
  int i;
  if (*s == 0){
    if (p == NULL){
      // print the current trie path
    }
  }
  else if (budget >= 0){
    // try deleting this letter
    walk(p, s+1, budget-1);
    // try swapping two adjacent letters
    if (s[1]){
      swap(s[0], s[1]);
      walk(p, s, budget-1);
      swap(s[0], s[1]);
    }
    if (p){
      for (i = 0; i < 128; i++){
        // try exact match
        if (i == *s) walk(p->p[i], s+1, budget);
        // try replacing this character
        if (i != *s) walk(p->p[i], s+1, budget-1);
        // try inserting this letter
        walk(p->p[i], s, budget-1);
      }
    }
  }
}

Basically, you simulate deleting a letter by skipping it and searching at the same node. You simulate inserting a letter by descending the trie without advancing s. You simulate replacing a letter by acting as if the letter matched, even though it doesn't. When you get the hang of it, you can add other possible mismatches, like replacing 0 with O and 1 with L or I - dumb stuff like that.

You probably want to add a character array argument to represent the current word you are finding in the trie.

edited May 27 '10 at 17:02

answered May 27 '10 at 11:36

Mike Dunlavey

40,059
14
91
135

An outline would be really helpful. I understand walking with error budgets, but I don't really know how to do the depth-first tree walk... – Matt May 27 '10 at 15:58
@Matt: a depth first tree walk? You can just use a recursive dfs function, or use a stack. Look up dfs. – Cam May 27 '10 at 17:06
Great! I've been working through the code trying to translate it to SQL and it's working alright so far. I'm not quite sure how to turn the entire table into a Trie though and how to traverse it... It's not quite like C where we have pointers or anything. Anyone have any ideas? I'll probably post this as another question. Thanks again for the help! – Matt May 28 '10 at 05:04
@Matt: I don't think you can actually do this in SQL. I think you gotta grab all the data and build a trie in some other language. – Mike Dunlavey May 28 '10 at 11:10
Please check this question: http://stackoverflow.com/questions/2926790/how-can-i-store-data-in-a-table-as-a-trie-sql-server for details on how to do this in SQL. – Matt May 29 '10 at 06:26
Struggling a bit with a few parts of your code. Think you could rename some of the variables (p and s) or maybe add a bit more commenting in each section? It would really help out with finishing this up. – Matt May 29 '10 at 06:54
@Matt: `p` is just a pointer to the node, and it contains an array `p[]` indexed by a character, pointing to sub-nodes. `s` is just a pointer to character containing a null-terminated string. – Mike Dunlavey May 29 '10 at 14:48
@Mike: Yes but what is `s` for? You mention that I'll probably want to add a character array argument to represent what I'm searching for in the trie, but I thought that's what `s` was for. And why do you use 128 for the length your `p[]` array, would that be because of 128 ascii characters? – Matt Jun 09 '10 at 02:49
@Matt: `s` is the original array of characters of the string you are searching for. As the program is right now, there is nothing you can print to represent the word you have found. I assumed you could add that. And yes I assumed 128 ascii characters. That's how tries work. – Mike Dunlavey Jun 09 '10 at 12:51
This was very thought provoking for my own implementation, thanks. I've managed to squeeze a lot more performance out of my fuzzy-lookup code now, so thank you very much. One point that's not immediately apparent with this approach is that the same key can be found multiple times, for example if one step adds `c` then the next step removes it, then two of your budget is spent and, if a 0-distance match exists, you will have it returned along with a 2-distance match of the same node. Theoretically this algorithm may be faster if extra pruning can be done in a computationally efficient way. – Drew Noakes Dec 21 '10 at 05:38
I added two args to `walk` to track `inserted` and `removed`, so if an insert is made, then the remove logic won't try to recurse with that char. Similarly, if a char is removed, it won't try to re-add it. I will run some profiling to measure perf differences. However even with these two checks, dupes are being returned. Of course I may have misinterpreted your approach! – Drew Noakes Dec 21 '10 at 05:41
@Matt, I can't imagine possibly implementing this in SQL! You may be able to do something with a database that allows embedding code such as MsSQL/CLR, but the load time of the supporting data structure is several orders of magnitude higher than the query time, so you don't want to build this structure for every request if you can avoid it. – Drew Noakes Dec 21 '10 at 05:43
Why is the end condition p==NULL? Shouldn't there be some check that a word ends on the node? – Pär Bohrarper Dec 23 '10 at 14:30
@Pär: That's what the `*s == 0` above it does? – Mike Dunlavey Dec 23 '10 at 20:27
But s is the search term, right? Say your trie has "cat" and "cataract" in it. How do you know that "cat" is a word, but not "catar"? – Pär Bohrarper Dec 26 '10 at 14:26
@Pär: `s` is a `char*`. It is being advanced forward through the search term. When it reaches the trailing null character, it is at the end of the word. – Mike Dunlavey Dec 26 '10 at 19:55
Your method repeats work unnecessarily, in the case of exact match and replacement. I've found that you can save this work by combining the dynamic programming method with the trie. Each time you recurse, you fill in a single row of the table. If any entry in the row is less than the error budget, keep recursing and adding rows. When you return from the function, discard the row. This is like using the N^2 algorithm for each word, except that 1) we do them in alphabetical order so we don't repeat any work-rows, and 2) we discard entire branches due to the error budget. (rhymebrain.com does it) – Steve Hanov Jan 03 '11 at 15:02
@Steve: It's been a long time since I did this, so I may not have it *exactly* right. Originally I saw this as a simple alternative to combined breadth/depth-first search, because it's easy to elaborate with more complex misspelling rules. Thanks for your input. – Mike Dunlavey Jan 03 '11 at 15:42
@Mike: Yes, I know that means it's at the end of the *search term*, but how do you know that the position you are at in the trie when you reach the end of the search term is in fact a word that you inserted in the trie? Cf my example with "cat" and "cataract", I don't see how your code would find "cat" given "kat" or "catt" as search term. – Pär Bohrarper Jan 08 '11 at 18:17
@Pär: because `p` is null. It is walking the entire trie, and a terminus is when it gets to a null pointer. So if the pointer is null, and `*s` is the null character, then the trie entry (the stack of characters above the null pointer) can be output. No? – Mike Dunlavey Jan 08 '11 at 23:16

Optimizing Levenshtein distance algorithm

1 Answers1

Linked