1

I'm designing a cool spell checker (I know I know, modern browsers already have this), anyway, I am wondering what kind of effort would it take to develop a fairly simple but decent suggest-word algorithm.

My idea is that I would first look through the misspelled word's characters and count the amount of characters it matches in each word in the dictionary (sounds resources intensive), and then pick the top 5 matches (so if the misspelled word matches the most characters with 7 words from the dictionary, it will randomly display 5 of those words as suggested spelling).

Obviously to get more advanced, we would look at "common words" and have a dictionary file that is numbered with 'frequency of that word used in English language' ranking. I think that's taking it a bit overboard maybe.

What do you think? Anyone have ideas for this?

Dexter
  • 6,170
  • 18
  • 74
  • 101
  • 1
    This has been discussed in SO before. See [here](http://stackoverflow.com/questions/41424/how-do-you-implement-a-did-you-mean) and [here](http://stackoverflow.com/questions/473522/word-comparison-algorithm), for example. Also, [another](http://www.norvig.com/spell-correct.html) link, from Peter Norvig, on how to create a simple spell checker. – abeln Apr 25 '11 at 19:48

2 Answers2

2

First of all you will have to consider the complexity in finding the "nearer" words to the misspelled word. I see that you are using a dictionary, a hash table perhaps. But this may not be enough. The best and cooler solution here is to go for a TRIE datastructure. The complexity of finding these so called nearer words will take linear order timing and it is very easy to exhaust the tree.

A small example

Take the word "njce". This is a level 1 example where one word is clearly misspelled. The obvious suggestion expected would be nice. The first step is very obvious to see whether this word is present in the dictionary. Using the search function of a TRIE, this could be done O(1) time, similar to a dictionary. The cooler part is finding the suggestions. You would obviously have to exhaust all the words that start with 'a' to 'z' that has words like ajce bjce cjce upto zjce. Now to find the occurences of this type is again linear depending on the character count. You should not carried away by multiplying this number with 26 the length of words. Since TRIE immediately diminishes as the length grows. Coming back to the problem. Once that search is done for which no result was found, you go the next character. Now you would be searching for nace nbce ncce upto nzce. In fact you wont have explore all the combinations as the TRIE data structure by itself will not be having the intermediate characters. Perhaps it will have na ni ne no nu characters and the search space becomes insanely simple. So are the further occurrences. You could develop on this concept further based on second and third order matches. Hope this helped.

bragboy
  • 34,892
  • 30
  • 114
  • 171
  • Could you elaborate on that? How would you use a _trie_ to generate suggestions? – abeln Apr 25 '11 at 19:44
  • Yeah it does seem more complex. The simple algorithm I mentioned in my original post, brings up the most nonsense matches hehehe (even though it matches them). – Dexter Apr 25 '11 at 19:49
  • @Dexter: Do look at the links I posted on the comment to the question. This is a well studied, not hard to implement problem. – abeln Apr 25 '11 at 19:51
  • ofcourse this is not a state of the art solution, but very much decent and you can definitely explore a lot with this particular DS. – bragboy Apr 25 '11 at 19:56
  • @bragboy, @abeln the python example is nice but my program is C#. I don't know how to use lucene. I don't know how to use TRIE either (and it's in java). I'd rather not have to try to implement a search tree just for this simple simple problem. I'm looking for more practical methods that will not take long to implement in C#. The Levenshtein distance algorithm looks interesting, but I'm not sure if it will work well if I take the time to convert it to C#. – Dexter Apr 25 '11 at 20:01
  • @Bragboy: so if I understood correctly you iterate over every word in the trie; but that's not really more efficient that going through every word in the dictionary, even if they were stored, say, in an array, is it? – abeln Apr 25 '11 at 20:01
  • @Dexter: the Levenshtein gives you a measure of how "similar" two strings are. That's crucial to get meaningful suggestions, which is the crux of the problem. – abeln Apr 25 '11 at 20:03
  • This is true. I'm testing out my version of the Levenshtein Distance computation myself. I hope this works well! We'll see ! Thanks for the suggestion. – Dexter Apr 25 '11 at 20:04
  • @abeln: good question. perhaps this is a very common misunderstanding to think that you have iterate over every word, that a lot have towards TRIE implementation. The practical diminishing search that TRIE brings to the table is incomparable. Seriously, you can check this out yourself! – bragboy Apr 25 '11 at 20:05
  • check this out : http://blog.afterthedeadline.com/2010/01/29/how-i-trie-to-make-spelling-suggestions/ – bragboy Apr 25 '11 at 20:08
  • Ok, so the Levenshtein distance algorithm is fairly close. I tried 'thiz is a spel chek progam'. It was able to suggest top 5 correct spellings for thiz, chek, and progam. However, it couldn't figure out 'spel' into 'spell'. Still, I think this suits my needs as a very simplistic computation of suggested words. What other ways can I improve this? – Dexter Apr 25 '11 at 20:08
  • @Dexter: against what are you comparing the misspelled word, the whole dictionary? What suggestion are you getting for "spel"? – abeln Apr 25 '11 at 20:10
  • speel, sel, seel, spec, sped lots of words, that are apparently in my dictionary file, and they apparently are real words ( I checked ). I'm sure spell does come up, but not in the top 5. – Dexter Apr 25 '11 at 20:33
0

I'm not sure how much of the wheel you're trying to reinvent, so you may want to check out Lucene.

Apache Lucene Core™ (formerly named Lucene Java), our flagship sub-project, provides a Java-based indexing and search implementation, as well as spellchecking, hit highlighting and advanced analysis/tokenization capabilities.

Jim Bolla
  • 8,265
  • 36
  • 54
  • I'm using C#, and there isn't any spell checking algorithm that is easily implementable. – Dexter Apr 25 '11 at 19:48
  • There's a .NET implementation of Lucene called, unsurprisingly, Lucene.NET. – Jim Bolla Apr 25 '11 at 19:54
  • Hmm I'll have to look into that. For now though I think the Levenshtein Distance algorithm works great and is simple for my purposes. – Dexter Apr 25 '11 at 20:32