Creating a "spell check" that checks against a database with a reasonable runtime

Question

I'm not asking about implementing the spell check algorithm itself. I have a database that contains hundreds of thousands of records. What I am looking to do is checking a user input against a certain column in a table for all these records and return any matches with a certain hamming distance (again, this question's not about determining hamming distance, etc.). The purpose, of course, is to create a "did you mean" feature, where a user searches a name, and if no direct matches are found in the database, a list of possible matches are returned.

I'm trying to come up with a way to do all of these checks in the most reasonable runtime possible. How can I check a user's input against all of these records in the most efficient way possible?

The feature is currently implemented, but the runtime is exceedingly slow. The way it works now is it loads all records from a user-specified table (or tables) into memory and then performs the check.

For what it's worth, I'm using NHibernate for data access.

I would appreciate any feedback on how I can do this or what my options are.

The easiest way (given that you've already implemented it) would be to do what Conrad suggested, and load the database once when the program loads. As for hitting the database more than once, would really need to see the logic behind your suggestions to work out the best way to query your database. — Rob, Jan 28 '11 at 23:34
If you load all the DB records into memory, do you have to manage synchronising the data when the DB changes, or is it static data? — Drew Noakes, Jan 29 '11 at 15:55
@Drew: when I say "loads all records from a user-specified table (or tables) into memory," I mean it loads them when the user makes a search and no results are found. Regardless, the data is fairly static. — Tyler Treat, Jan 29 '11 at 19:38
In that case I'd suggest populating a search structure in memory and keeping it around for multiple queries. That's the only way I know of to do what you're talking about in tens of milliseconds. — Drew Noakes, Jan 29 '11 at 20:03

Drew Noakes · Accepted Answer · 2011-01-30T15:57:15.987

7

Calculating Levenshtein distance doesn't have to be as costly as you might think. The code in the Norvig article can be thought of as psuedocode to help the reader understand the algorithm. A much more efficient implementation (in my case, approx 300 times faster on a 20,000 term data set) is to walk a trie. The performance difference is mostly attributed to removing the need to allocate millions of strings in order to do dictionary lookups, spending much less time in the GC, and you also get better locality of reference so have fewer CPU cache misses. With this approach I am able to do lookups in around 2ms on my web server. An added bonus is the ability to return all results that start with the provided string easily.

The downside is that creating the trie is slow (can take a second or so), so if the source data changes regularly then you need to decide whether to rebuild the whole thing or apply deltas. At any rate, you want to reuse the structure as much as possible once it's built.

edited Jan 30 '11 at 15:57

answered Jan 29 '11 at 16:00

Drew Noakes

300,895
165
679
742

See http://norvig.com/ngrams/ for an implementation (using a hashtable instead of a trie, actually, but the idea is the same). – Darius Bacon Jan 30 '11 at 06:52
@Darius, that page features code that's pretty much identical to the other Norvig article I was talking about, however it uses a hash table lookup. Like I said, the psuedo code is similar, but the implementation of a trie-based approach is quite different. The trie isn't just used to test for the existence of generated terms. Instead it is walked in a fashion that obviates the need to generate all terms with an edit distance of two. – Drew Noakes Jan 30 '11 at 16:00
It sounds like you only skimmed that code and got the wrong impression. Norvig: "If we considered all edits, a word like “acommodations” would yield 233,166 candidates. But only 11 of these are in the vocabulary. So edits works by pre-computing the set of all prefixes of all the words in the vocabulary. It then calls editsR recursively, splitting the word into a head and a tail (hd and tl in the code) and assuring that the head is always in the list of prefixes." It was derived from code I sent him, derived in turn from my http://wry.me/~darius/hacks/spelling.tar.gz which had used a trie. – Darius Bacon Jan 31 '11 at 08:52
When I ported to Python, it was *more* efficient to represent the prefixes in a dict instead of a trie. In C, it might well go the other way, as you say; but the difference is a smallish constant factor. – Darius Bacon Jan 31 '11 at 08:56
Morning @Darius. I think perhaps I wasn't detailed enough. My original implementation was a naive port of the code in Norvig's article to C# (`edits1`), generating alternates and then looking them up. The benefit of using a trie is that you can combine the two tasks and end up with a depth-first search of the trie, simulating substitutions/additions/etc as you walk, and short circuit the search when your max edit distance is met. The difference is more than just a distinction between lookup performances. – Drew Noakes Jan 31 '11 at 10:20
Argh! :) Look at it this way: when you walk a trie, you're visiting the set of prefixes of dictionary words. You prune that walk according to edit distances as you go. This is the algorithm we're talking about, yes? And the key speedup comes because most edits never even get considered, because they don't start with one of those prefixes. Still agreed? You seem to contrast this with some program that generates the distance-k edits of the source word, and checks if each edit is in the dictionary, like Norvig's earlier article, but, somehow, faster. Neither of the programs I linked to does this. – Darius Bacon Jan 31 '11 at 19:54
@Darius, yes that's the algorithm I'm talking about. I drew that contrast because that's the path I followed after reading Norvig's first article. I'll re-read the links you posted when I have a chance, thanks. It seems I missed something wasn't explained directly in this thread. If you can elaborate about a better solution than the one I've been championing here, I'd love to hear about it (I understand it's not straightforward to define _better_.) – Drew Noakes Jan 31 '11 at 21:56
As far as I can tell, it's the same idea -- I was just pointing to code fleshing it out. I was the first to upvote this answer -- wasn't trying to criticize you. (I came to it a bit differently: I wrote spelling.tar.gz before Norvig's article and I didn't try to handle transpositions or word probability at the time.) – Darius Bacon Jan 31 '11 at 23:34

score 3 · Answer 2 · answered Jan 29 '11 at 06:08

As Darcara said, a BK-Tree is a good first take. They are very easy to implement. There are several free implementations easily found via Google, but a better introduction to the algorithm can be found here: http://blog.notdot.net/2007/4/Damn-Cool-Algorithms-Part-1-BK-Trees.

Unfortunately, calculating the Levenshtein distance is pretty costly, and you'll be doing it a lot if you're using a BK-Tree with a large dictionary. For better performance, you might consider Levenshtein Automata. A bit harder to implement, but also more efficient, and they can be used to solve your problem. The same awesome blogger has the details: http://blog.notdot.net/2010/07/Damn-Cool-Algorithms-Levenshtein-Automata. This paper might also be interesting: http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.16.652.

Don't think I've been called awesome before. ;) – Nick Johnson Jan 31 '11 at 00:05 — Nick Johnson, Jan 31 '11 at 00:05

score 2 · Answer 3 · edited May 23 '17 at 12:00

I guess the Levenshtein distance is more useful here than the Hamming distance.

Let's take an example: We take the word example and restrict ourselves to a Levenshtein distance of 1. Then we can enumerate all possible misspellings that exist:

1 insertion (208)
- aexample
- bexample
- cexample
- ...
- examplex
- exampley
- examplez
1 deletion (7)
- xample
- eample
- exmple
- ...
- exampl
1 substitution (182)
- axample
- bxample
- cxample
- ...
- examplz

You could store each misspelling in the database, and link that to the correct spelling, example. That works and would be quite fast, but creates a huge database.

Notice how most misspellings occur by doing the same operation with a different character:

1 insertion (8)
- ?example
- e?xample
- ex?ample
- exa?mple
- exam?ple
- examp?le
- exampl?e
- example?
1 deletion (7)
- xample
- eample
- exmple
- exaple
- examle
- exampe
- exampl
1 substitution (7)
- ?xample
- e?ample
- ex?mple
- exa?ple
- exam?le
- examp?e
- exampl?

That looks quite manageable. You could generate all these "hints" for each word and store them in the database. When the user enters a word, generate all "hints" from that and query the database.

Example: User enters exaple (notice missing m).

SELECT DISTINCT word
           FROM dictionary
          WHERE hint = '?exaple'
             OR hint = 'e?xaple'
             OR hint = 'ex?aple'
             OR hint = 'exa?ple'
             OR hint = 'exap?le'
             OR hint = 'exapl?e'
             OR hint = 'exaple?'
             OR hint = 'xaple'
             OR hint = 'eaple'
             OR hint = 'exple'
             OR hint = 'exale'
             OR hint = 'exape'
             OR hint = 'exapl'
             OR hint = '?xaple'
             OR hint = 'e?aple'
             OR hint = 'ex?ple'
             OR hint = 'exa?le'
             OR hint = 'exap?e'
             OR hint = 'exapl?'

exaple with 1 insertion == exa?ple == example with 1 substitution

See also: How does the Google “Did you mean?” Algorithm work?

+1 agreed on the Levenshtein distance - unfortunately the Peter Norwig article does not address the memory issue - it's a tradeoff off memory vs. processing speed - depending on the size loading the full dictionary into memory might not even be possible, but saving all misspellings in the DB does not sound right either. — BrokenGlass, Jan 29 '11 at 00:05
@BrokenGlass, using a trie can reduce the memory footprint for large data sets. It also improves lookup performance enormously (200-300 times faster in my case.) — Drew Noakes, Jan 30 '11 at 16:17
Thanks Drew - I'll have to look into that - upvoted your answer as well — BrokenGlass, Jan 30 '11 at 18:04

frazman · Answer 4 · 2012-04-15T06:01:08.160

The answer you marked as correct..

Note: when i say dictionary.. in this post, i mean hash map .. map.. 
 basically i mean a python dictionary

Another way you can improve its performance by creating an inverted index of words.

So rather than calculating the edit distance against whole db, you create 26 dictionary.. each has a key an alphabet. so english language has 26 alphabets.. so keys are "a","b".. "z"

So assume you have word in your db "apple"

So in the "a" dictionary : you add the word "apple"

in the "p" dictionary: you add the word "apple"

in the "l" dictionary: you add the word "apple"

in the "e" dictionary : you add the word "apple"

So, do this for all the words in the dictionary..

Now when the misspelled word is entered..

lets say aplse

you start with "a" and retreive all the words in "a"

then you start with "p" and find the intersection of words between "a" and "p"

then you start with "l" and find the intersection of words between "a", "p" and "l"

and you do this for all the alphabetss.

in the end you will have just the bunch of words which are made of alphabets "a","p","l","s","e"

In the next step, you calculate the edit distance between the input word and the bunch of words returned by the above steps.. thus drastically reducing your run time..

now there might be a case when nothing might be returned..

so something like "aklse".. there is a good chance that there is no word which is made of just these alphabets.. In this case, you will have to start reversing the above step to a stage where you have finite numbers of word left.

So somethng like start with *klse (intersection between words k, l,s,e) num(wordsreturned) =k1

then a*lse( intersection between words a,l,s,e)... numwords = k2

and so on.. choose the one which have higher number of words returned.. in this case, there is really no one answer.. as a lot of words might have same edit distance.. you can just say that if editdistance is greater than "k" then there is no good match...

There are many sophisticated algorithms built on top of this..

like after these many steps, use statistical inferences (probability the word is "apple" when the input is "aplse".. and so on) Then you go machine learning way :)

score 1 · Answer 5 · answered Jan 28 '11 at 22:52

it loads all records from a user-specified table (or tables) into memory and then performs the check

don't do that

Either

Do the match match on the back end and only return the results you need.

or

Cache the records into memory early on a take the working set hit and do the check when you need it.

score 1 · Answer 6 · answered Jan 28 '11 at 23:37

You will need to structure your data differently than a database can. Build a custom search tree, with all dictionary data needed, on the client. Although memory might become a problem if the dictionary is extremely big, the search itself will be very fast. O(nlogn) if I recall correctly.

Have a look at BK-Trees

Also, instead of using the Hamming distance, consider the Levenshtein distance

Creating a "spell check" that checks against a database with a reasonable runtime

6 Answers6

Linked