187

I want to check in a Python program if a word is in the English dictionary.

I believe nltk wordnet interface might be the way to go but I have no clue how to use it for such a simple task.

def is_english_word(word):
    pass # how to I implement is_english_word?

is_english_word(token.lower())

In the future, I might want to check if the singular form of a word is in the dictionary (e.g., properties -> property -> english word). How would I achieve that?

Salvador Dali
  • 214,103
  • 147
  • 703
  • 753
Barthelemy
  • 8,277
  • 6
  • 33
  • 36
  • you can see this page : https://stackoverflow.com/questions/39142778/python-how-to-determine-the-language I recommend the `langid` – Mahdi Ebi Oct 12 '21 at 19:42

12 Answers12

280

For (much) more power and flexibility, use a dedicated spellchecking library like PyEnchant. There's a tutorial, or you could just dive straight in:

>>> import enchant
>>> d = enchant.Dict("en_US")
>>> d.check("Hello")
True
>>> d.check("Helo")
False
>>> d.suggest("Helo")
['He lo', 'He-lo', 'Hello', 'Helot', 'Help', 'Halo', 'Hell', 'Held', 'Helm', 'Hero', "He'll"]
>>>

PyEnchant comes with a few dictionaries (en_GB, en_US, de_DE, fr_FR), but can use any of the OpenOffice ones if you want more languages.

There appears to be a pluralisation library called inflect, but I've no idea whether it's any good.

Kaushik Acharya
  • 1,520
  • 2
  • 16
  • 25
Katriel
  • 120,462
  • 19
  • 136
  • 170
  • 2
    Thank you, I did not know about PyEnchant and it is indeed much more useful for the kind of checks I want to make. – Barthelemy Sep 24 '10 at 16:52
  • It doesn't recognize ? Not a common word, but I know as an abbreviation for , and I do not know . Just wanted to point out that the solution isn't one-size-fits-all and that a different project might require different dictionaries or a different approach altogether. – dmh Apr 22 '12 at 18:02
  • Well, if you want a different dictionary you can always plug one in the back of PyEnchant! Note BTW that even the OED only lists "helo" as obsolete... – Katriel Apr 23 '12 at 19:19
  • How can one use Openoffice languages? – Palash Kumar Apr 14 '14 at 09:12
  • enchant doesnt recognize words like american,chinese,indian and countries's names – Alok Nayak Nov 28 '15 at 11:15
  • This is not exactly what the OP has asked for, though. Enchant is a spellchecker. There are many strings a spellchecker must accept but a dictionary should not include, such as "et" (because in appears in "et al"), "situ" (because it appears in "in situ"), "hominem", etc. Also note that if it is precisely dictionary forms that you are interested in, forms such as "gives" should return False too. – Roozbehan May 23 '16 at 05:36
  • 37
    Package is basically impossible to install for me. Super frustrating. – Monica Heddneck May 25 '17 at 00:43
  • 13
    Enchant is not supported at this time for python 64bit on windows :( https://github.com/rfk/pyenchant/issues/42 – Ricky Boyce Jul 05 '17 at 00:23
  • The "tutorial" link is broken. – Rémy Jan 15 '19 at 09:16
  • 13
    [pyenchant](https://github.com/rfk/pyenchant) is no longer maintained. [pyhunspell](https://github.com/blatinier/pyhunspell) has more recent activity. Also `/usr/share/dict/` and `/var/lib/dict` may be referenced on *nix setups. – pkfm Mar 24 '19 at 01:38
  • 2
    [pyenchant](https://github.com/pyenchant/pyenchant) has apparently picked up a maintainer (August 2021). – Anaksunaman Aug 09 '21 at 18:56
  • Pyenchant doesn't have many normal words – Arnav Mehta Oct 05 '22 at 12:51
  • @pkfm still, I can't seem to install it. – Ac Hybl Nov 04 '22 at 01:17
  • Installation worked for me easily. – ABCD May 11 '23 at 21:19
75

It won't work well with WordNet, because WordNet does not contain all english words. Another possibility based on NLTK without enchant is NLTK's words corpus

>>> from nltk.corpus import words
>>> "would" in words.words()
True
>>> "could" in words.words()
True
>>> "should" in words.words()
True
>>> "I" in words.words()
True
>>> "you" in words.words()
True
Sadık
  • 4,249
  • 7
  • 53
  • 89
  • 12
    The same mention applies here too: a lot faster when converted to a set: `set(words.words())` – Iulius Curt Sep 30 '14 at 19:41
  • 1
    watch out as you need to singularise words to get proper results – famargar Sep 06 '18 at 10:44
  • 3
    caution : words like pasta or burger are not found in this list – Paroksh Saxena Jan 10 '19 at 11:37
  • 1
    Actually, no library can cover all English words. Moreover, `words.words()` from `nltk` includes proper nouns (e.g: Abraham) as English words, but they can occur in any language specially if the foreign text is transliterated to English. – hafiz031 Feb 17 '21 at 05:29
  • 1
    to be able to use `words` you first need to install it - `import nltk` and then `nltk.download('words')` – SubMachine Jan 21 '22 at 10:59
  • @SubMachine, additionally, it only worked for me if I did this in the python console. – Ac Hybl Nov 04 '22 at 01:33
53

Using NLTK:

from nltk.corpus import wordnet

if not wordnet.synsets(word_to_test):
  #Not an English Word
else:
  #English Word

You should refer to this article if you have trouble installing wordnet or want to try other approaches.

nickb
  • 59,313
  • 13
  • 108
  • 143
Susheel Javadi
  • 3,034
  • 3
  • 32
  • 34
42

Using a set to store the word list because looking them up will be faster:

with open("english_words.txt") as word_file:
    english_words = set(word.strip().lower() for word in word_file)

def is_english_word(word):
    return word.lower() in english_words

print is_english_word("ham")  # should be true if you have a good english_words.txt

To answer the second part of the question, the plurals would already be in a good word list, but if you wanted to specifically exclude those from the list for some reason, you could indeed write a function to handle it. But English pluralization rules are tricky enough that I'd just include the plurals in the word list to begin with.

As to where to find English word lists, I found several just by Googling "English word list". Here is one: http://www.sil.org/linguistics/wordlists/english/wordlist/wordsEn.txt You could Google for British or American English if you want specifically one of those dialects.

kindall
  • 178,883
  • 35
  • 278
  • 309
  • 9
    If you make `english_words` a `set` instead of a `list`, then `is_english_word` will run a lot faster. – dan04 Sep 24 '10 at 16:14
  • I actually just redid it as a dict but you're right, a set is even better. Updated. – kindall Sep 24 '10 at 16:16
  • 1
    You can also ditch `.xreadlines()` and just iterate over `word_file`. – FogleBird Sep 24 '10 at 16:18
  • Thanks for your answer. The reason why I wanted to use wordnet was because I could not find any standard/obvious list of English words including plural. Where would I find such files (with plural included)? – Barthelemy Sep 24 '10 at 16:27
  • 3
    Under ubuntu the packages `wamerican` and `wbritish` provide American and British English word lists as `/usr/share/dict/*-english`. The package info gives http://wordlist.sourceforge.net as a reference. – intuited Sep 24 '10 at 16:45
  • +1 I've tried all the answers here and this one is by far the easiest, fastest, lightest and most reliable way described. Plus if you have a special vocabulary or some slang you want to include you can just add it to your list. – Ryan Epp Dec 28 '13 at 15:56
  • 1
    I find a [GitHub repository](https://github.com/dwyl/english-words) which contains 479k English words. – haolee May 29 '17 at 02:59
16

For All Linux/Unix Users

If your OS uses the Linux kernel, there is a simple way to get all the words from the English/American dictionary. In the directory /usr/share/dict you have a words file. There is also a more specific american-english and british-english files. These contain all of the words in that specific language. You can access this throughout every programming language which is why I thought you might want to know about this.

Now, for python specific users, the python code below should assign the list words to have the value of every single word:

import re
file = open("/usr/share/dict/words", "r")
words = re.sub("[^\w]", " ",  file.read()).split()
file.close()
    
def is_word(word):
    return word.lower() in words
 
is_word("tarts") ## Returns true
is_word("jwiefjiojrfiorj") ## Returns False

Hope this helps!

EDIT: If you can't find the words file or something similiar, see the comment from Dr Phil below.

James Ashwood
  • 469
  • 6
  • 13
  • 4
    This is a great answer as it avoids having to install a massive NLP library for this simple task. Only comment is in your example you leave the file open - doing this in a `with open(...)` block would be better (or just adding file.close() after you've loaded the words). – mdmjsh Jan 13 '22 at 11:21
  • 1
    If you don't have a words file by default on your Linux installation (*my Ubuntu 22.04 didn't have it*), then you can run ```sudo apt install wordlist``` to find all the relevant packages. For example I then ran ```sudo apt install wamerican``` to get the American English wordlist installed – Dr Phil Jun 22 '23 at 10:28
9

For a faster NLTK-based solution you could hash the set of words to avoid a linear search.

from nltk.corpus import words as nltk_words
def is_english_word(word):
    # creation of this dictionary would be done outside of 
    #     the function because you only need to do it once.
    dictionary = dict.fromkeys(nltk_words.words(), None)
    try:
        x = dictionary[word]
        return True
    except KeyError:
        return False
Guillaume Jacquenot
  • 11,217
  • 6
  • 43
  • 49
Eb Abadi
  • 585
  • 5
  • 17
  • 4
    Instead of a dictionary, use a set – jhuang Jun 20 '18 at 03:30
  • 2 comments - don't rebuild the set or dictionary every call, that will take time. I tested and I find the set version indeed faster, in my work environment 124ms average versus 240ms – user1617979 Aug 16 '22 at 17:59
8

I find that there are 3 package-based solutions to solve the problem. They are pyenchant, wordnet and corpus(self-defined or from ntlk). Pyenchant couldn't installed easily in win64 with py3. Wordnet doesn't work very well because it's corpus isn't complete. So for me, I choose the solution answered by @Sadik, and use 'set(words.words())' to speed up.

First:

pip3 install nltk
python3

import nltk
nltk.download('words')

Then:

from nltk.corpus import words
setofwords = set(words.words())

print("hello" in setofwords)
>>True
Tom M.
  • 105
  • 2
Young Yang
  • 134
  • 1
  • 5
  • Before checking your `input_word`, use `input_word.lower()` to convert it to lowercase. Only lowercase words seem to be present in nltk words list. – Bikash Gyawali Jan 19 '21 at 17:23
3

With pyEnchant.checker SpellChecker:

from enchant.checker import SpellChecker

def is_in_english(quote):
    d = SpellChecker("en_US")
    d.set_text(quote)
    errors = [err.word for err in d]
    return False if ((len(errors) > 4) or len(quote.split()) < 3) else True

print(is_in_english('“办理美国加州州立大学圣贝纳迪诺分校高仿成绩单Q/V2166384296加州州立大学圣贝纳迪诺分校学历学位认证'))
print(is_in_english('“Two things are infinite: the universe and human stupidity; and I\'m not sure about the universe.”'))

> False
> True
grizmin
  • 149
  • 5
  • 1
    This will return true if the text is longer than 3 words and there are less than 4 errors (non-recognised words). In general for my use case those settings work pretty well. – grizmin Aug 02 '19 at 13:54
1

For a semantic web approach, you could run a sparql query against WordNet in RDF format. Basically just use urllib module to issue GET request and return results in JSON format, parse using python 'json' module. If it's not English word you'll get no results.

As another idea, you could query Wiktionary's API.

Community
  • 1
  • 1
burkestar
  • 753
  • 1
  • 4
  • 12
0

use nltk.corpus instead of enchant. Enchant gives ambiguous results. For example : for benchmark and bench-mark enchant is returning true. It should suppose to return false for benchmark.

Anand Kuamr
  • 1
  • 1
  • 1
0

Download this txt file https://raw.githubusercontent.com/dwyl/english-words/master/words_alpha.txt

then create a Set out of it using the following python code snippet that loads about 370k non-alphanumeric words in english

>>> with open("/PATH/TO/words_alpha.txt") as f:
>>>     words = set(f.read().split('\n'))
>>> len(words)
370106

From here onwards, you can check for existence in constant time using

>>> word_to_check = 'baboon'
>>> word_to_check in words
True

Note that this set might not be comprehensive but still gets the job done, user should do quality checks to make sure it works for their use-case as well.

Ayush
  • 479
  • 2
  • 9
  • 24
0

None of the above libraries contains all english words, so I imported a csv file containing all english words from link:--> https://github.com/dwyl/english-words

And simply made that into a pandas dataframe and compared them

GeekyPS
  • 11
  • 3