3

I need a spell checker in python. I've looked at previous answers and they all seem to be outdated now or not applicable:

Python spell checker using a trie This question is more about the data structure.

Python Spell Checker This is a spelling corrector, given two strings.

http://norvig.com/spell-correct.html Often referenced and quite interesting, but also a spelling corrector, and accuracy isn't quite good enough, though I'll probably use this in combination with an checker.

Spell Checker for Python Uses pyenchant which isn't maintained anymore.

Python: check whether a word is spelled correctly Also suggests Pyenchant which isn't maintained.

Some details of what I need:

  • A function that accepts a string (word) and returns a boolean whether the word is valid English of not. The unit test would want True on an input of "car" and False on an input of "ijjk".
  • Accuracy needs to be above 90%, but not higher than that. I'm just using this to exclude words during preprocessing for document classification. Most of the errors will be picked up anyway as words that appear too seldom (though not all.). Spell correcting won't work in all cases because a lot of the errors are OCR issues that are too far off to fix.
  • If it can deal with legal terms that would be a big plus. Otherwise I might need to manually add certain terms to the dictionary.

What's the best approach here? Are there any maintained libraries? Do I need to download a dictionary and check against it?

Neil
  • 3,020
  • 4
  • 25
  • 48

3 Answers3

3

2 recent Python libraries, both based on Levenshtein minimum edit distance optimized for the task:

It should be mentioned that the symspellpy link above is the Python port of the original SymSpell C# implementation its description is here. The original SymSpell Github repository includes a dictionary with word frequencies.

Spello includes a basic pre-trained model on 30K news and 30K Wikipedia articles. But it's better to train it on your custom corpus from your domain.

denis_smyslov
  • 741
  • 8
  • 8
1

If you need simple per-word check, you just need corpus of words (preferably matching your terminology), read it into python set and make membership check for every single word one by one.

Once/if you have issues with this naive implementation, you'll drill down to concrete problems.

Slam
  • 8,112
  • 1
  • 36
  • 44
  • For Googlers: check this answer: https://stackoverflow.com/questions/28339622/is-there-a-corpora-of-english-words-in-nltk – Neil Oct 23 '18 at 12:22
1

You can use a dedicated spellchecking library in Python called enchant

To check a word's spelling is correct i.e whether such a word exists in English, all you have to do is this:

import enchant
d = enchant.Dict("en_US")
d.check("scienc")

This will give an output:

False

The best part about this library is it suggests the right spelling of the words. For example:

d.suggest("scienc")

will give an output:

['science', 'scenic', 'sci enc', 'sci-enc', 'scientist']

There are more features in this library. For example, in the above sample code I have used USA English corpus ("en_US"). You can use other English corpuses like "en_AU" for Australian English, "en_CA", "en_GB" for Canada and Great Britain respectively to name a few. Non-English language support is also there like "fr_FR" for French!

For advanced usage, this library can be used to check words against a custom list of words (this feature will come in handy when you have a set of Proper Nouns). This is simply a file listing the words to be considered, one word per line. The following example creates a Dict object for the personal word list stored in “my_custom_words.txt”:

custom_d = enchant.request_pwl_dict("my_custom_words.txt")

To check out more features and other aspects of it, refer: http://pyenchant.github.io/pyenchant/

Sushanth
  • 2,224
  • 13
  • 29