Best Practice For Levenshtein Distance on SQL Server

Question

I have a web and a mobile dictionary application that uses SQL Server. I am trying to implement a simple version of "did you mean" feature. If the phrase that user entered is not exists in the db, I need make a suggestions.

I am planning to use the levenshtein distance algorithm. But there is a point that I couldn't figure out: do I need to calculate the levenshtein distance between user entry and all the words that exists in my db one by one?

Let's assume that I have one million word in my database. When user enters an incorrect word, will I calculate distance a million time?

Obviously that would need a great deal of time. What is the best practice for this situation?

May be a little dated, but take a peek at https://stackoverflow.com/questions/560709/levenshtein-distance-in-t-sql — John Cappelletti, Sep 02 '17 at 13:44
If doing this for anything else than a learning experience, I would seriously advice you to reconsider doing this in the database. A sql database is very good at relational queries, but when it comes to something like this much better tools exists you can utilize. — Allan S. Hansen, Jan 04 '18 at 20:47
Consider the following thread. I think It'll assist you. [enter link description here](https://cs.stackexchange.com/questions/69726/match-dictionary-to-misspelled-word-corner-cases) — Derrick Bell, Jan 05 '18 at 23:55

score 1 · Answer 1 · answered Sep 02 '17 at 10:03

1

Have you already looked at the SOUNDEX user defined function that is available in SQL Server ?

You could use a trigger which calculates the soundex of a column and saves it next to that column each time the column is updated. When searching, you can calculate the soundex of the search criterium and compare it with the stored soundex-column in the table.

answered Sep 02 '17 at 10:03

Frederik Gheysels

56,135
11
101
154

1

What about other languages? In my application, users will enter spanish or turkish words. – Umut Derbentoğlu Sep 02 '17 at 10:54

score 0 · Answer 2 · answered Sep 02 '17 at 21:51

In terms of implementation, I'd set it up so that the word list gets cached to the web server and do the comparisons there. You don't want to execute a database stored procedure every time a user makes a keystroke. For performance reasons, you'll want to make the back & forth as shot and simple as possible. Besides, procedural languages are better at making these types of calculations than declarative languages anyway. If possible you may create a small indexed cache on the client machine so that the final stages can be completed w/o making any web calls.

In terms of making the actual matches, look up Lawrence Philips' Double Metaphone algorithm. It's not as good a Google's "did you mean?" but it's much better than SOUNDEX... And it's been translated into multiple coding languages. By using double metaphone in conjunction with Levenshtein distance you should be able to made some good matches.

Best Practice For Levenshtein Distance on SQL Server

2 Answers2