25

I need to implement a spell checker in C. Basically, I need all the standard operations... I need to be able to spell check a block of text, make word suggestions and dynamically add new words to the index.

I'd kind of like to write this myself, tho I really don't know where to begin.

coppro
  • 14,338
  • 5
  • 58
  • 73
dicroce
  • 45,396
  • 28
  • 101
  • 140
  • 1
    Please do some of the work before throwing it over to Stack Overflow. Sketch out a design, identify the key blocks preventing you making progress, tell us about the context this will be used in - put in some effort. – The Archetypal Paul Dec 06 '08 at 20:54
  • If you are looking for answers like "read this: link" say so. you might get better responses. – BCS Dec 06 '08 at 20:57
  • 3
    So, Paul, this question isn't a good "stackoverflow" question? Exactly how? You know, a page on the internet called "How do spell checks work?" with thoughtful answers would be a useful thing to anyone just starting out learning how these things work. – dicroce Dec 06 '08 at 21:08
  • As worded, I don't think it is a good question, no. I think it needed more context on why you needed to write one, and what you had worked out beforehand. It comes over as asking Stackoverflow to do your work for you. Maybe it's just me. – The Archetypal Paul Dec 07 '08 at 09:46
  • For the core problem of efficiently determining whether a word is in the dictionary or not, one popular technique is to use a [Bloom filter](https://en.wikipedia.org/wiki/Bloom_filter). – Steve Summit Jan 03 '22 at 22:20

7 Answers7

31

Read up on Tree Traversal. The basic concept is as follows:

  1. Read a dictionary file into memory (this file contains the entire list of correctly spelled words that are possible/common for a given language). You can download free dictionary files online, such as Oracle's example dictionary.
  2. Parse this dictionary file into a search tree to make the actual text search as efficient as possible. I won't describe all of the dirty details of this type of tree structure, but the tree will be made up of nodes which have (up to) 26 links to child nodes (one for each letter), plus a flag to indicate wether or not the current node is the end of a valid word.
  3. Loop through all of the words in your document, and check each one against the search tree. If you reach a node in the tree where the next letter in the word is not a valid child of the current node, the word is not in the dictionary. Also, if you reach the end of your word, and the "valid end of word" flag is not set on that node, the word is not in the dictionary.
  4. If a word is not found in the dictionary, inform the user. At this stage, you can also suggest alternate spellings, but that gets a tad more complicated. You will have to loop through each character in the word, substituting alternate characters and test each of them against the search tree. There are probably more efficient algorithms for finding the recommended words, but I don't know what they are.

A really short example:

Dictionary:

apex apple appoint appointed

Tree: (* indicates valid end of word) update: Thank you to Curt Sampson for pointing out that this data structure is called a Patricia Tree

A -> P -> E -> X*
      \\-> P -> L -> E*
           \\-> O -> I -> N -> T* -> E -> D*

Document:

apple appint ape

Results:

  • "apple" will be found in the tree, so it is considered correct.
  • "appint" will be flagged as incorrect. Traversing the tree, you will follow A -> P -> P, but the second P does not have an I child node, so the search fails.
  • "ape" will also fail, since the E node in A -> P -> E does not have the "valid end of word" flag set.

edit: For more details on spelling suggestions, look into Levenshtein Distance, which measures the smallest number of changes that must be made to convert one string into another. The best suggestions would be the dictionary words with the smallest Levenshtein Distance to the incorrectly spelled word.

Dave Jarvis
  • 30,436
  • 41
  • 178
  • 315
e.James
  • 116,942
  • 41
  • 177
  • 214
  • I added a link to the Wikipedia article on Levenshtein Distance. This is the best measurement algorithm I could find for ranking words according to how far away they are from a target word (the incorrectly spelled word, in this case) – e.James Dec 06 '08 at 22:30
  • a ternary tree is described in 2. It has a pretty big space tradeoff. – HeretoLearn Jul 03 '09 at 18:29
  • @HeretoLearn: I don't think that is correct. A ternary tree has a maximum of three children per node, but the structure I described has up to 26 children per node. – e.James Jul 03 '09 at 18:46
  • Would using a hash table work? You could convert the input string into a integer, and then look it up in the array of words? How would this compare to a tree? – Swedish Architect Feb 14 '14 at 10:32
3

Given you don't know where to begin, I'd suggest using an existing solution. See, for example, aspell (GLPL licenced). If you really have to implement it yourself, please tell us why.

The Archetypal Paul
  • 41,321
  • 20
  • 104
  • 134
  • I'm curious. I may choose to use an existing package, but I'd like to understand this problem first. After googling a bit, I'm considering using a levenstein distance algorithm against a dictionary, but I'm not sure if that will be fast enough (speed and code size matter). – dicroce Dec 06 '08 at 21:04
  • Another open source spell checker is hunspell, it is used by OOo and Mozilla. – quinmars Dec 07 '08 at 13:29
1

One should look at prefixes and suffixes.

suddenly = sudden + ly.

by removing ly's you can get away storing just the root word.

Likewise preallocate = pre + allocate.

And lovingly = love + ing + ly gets a bit more complex, as the english rules for ing get invoked.

There is also the possibility of using some sort of hashing function to map a root word into a specific bit is a large bit map, as a constant time method of determining if the root word is spelled correctly.

You can get even more complex by trying to provide an alternate list of possible correct spellings to a misspelled word. You might research the soundex algorithm to get some ideas.

I would advise prototyping with a small set of words. Do a lot of testing, then scale up. It is a wonderful educational problem.

EvilTeach
  • 28,120
  • 21
  • 85
  • 141
  • I can't agree with this. Algorithmically stemming English words is notoriously inaccurate. It might have been worth the space and time saving back in the dark ages, but not any more. – Daniel Cassidy Dec 07 '08 at 02:36
  • 6
    "English doesn't borrow from other languages, it follows them into dark alleys, knocks them over, and goes through their pockets for loose grammar." – Ben Blank Sep 09 '09 at 18:23
0

I've done this in class

You should consider python Natural Language Toolkit NLTK which is made specificaly to handle this.

It also allows to create text interpreters such as chatbots

Eric
  • 19,525
  • 19
  • 84
  • 147
0

The Open Office Spell checker Hunspell can be a good starting point. Here is the Homepage: Hunspell at Sourceforge

Thomas Maierhofer
  • 2,665
  • 18
  • 33
0

Splitting a word into root and suffix is knonw as the "Porter Stemming Algorithm" it's a good way of fitting an English ditionary into an amazingly small memory.
It's also useful for seach so "spell checker" will also find "spelling check" and "spell checking"

Martin Beckett
  • 94,801
  • 28
  • 188
  • 263
0

E James gives a great answer for how to tell if a word is valid. It probably depends on the spell checker for how they determine likely misspellings.

One such method, and the one that I would use is the Levenshteinn String Similarity which looks at how many letters must be added, removed, or swaped in a word in order to make another word.

If you say spelled: Country as Contry. The levenshtein string similarity would be 1 since you have to only add 1 letter to transform contry into country.

You could then loop through all possible correct spellings of words (only 171,000 english words and 3000 of those account for 95% of text). Determine those with the lowest levenshtein string similarity value, and then return the top X words that are most similar to the misspelled word.

There's a great python package called Fuzzy Wuzzy which implements this efficiently and generates a % similarity between two words or sentences based on this formula.

TheSaint321
  • 390
  • 2
  • 4
  • 17