detect English words and nltk's words corpus

Question

Just trying to see of a word is English or not. This:

english_words = set(nltk.corpus.words.words())
print("revised" in english_words)

results in False. Am I doing something wrong? Is this to be expected? Are there better ways of doing this? Thanks.

Dumb question, but did you check if the word revised is actually in the corpus? — InfiniteHigh, Feb 07 '19 at 13:45
Possible duplicate of [How to check if a word is an English word with Python?](https://stackoverflow.com/questions/3788870/how-to-check-if-a-word-is-an-english-word-with-python) — Amit Gupta, Feb 07 '19 at 13:48
I found a similar question [here](https://stackoverflow.com/questions/44449284/nltk-words-corpus-does-not-contain-okay) — InfiniteHigh, Feb 07 '19 at 13:50

adrianus · Accepted Answer · 2019-02-07T14:10:02.127

2

It seems that "revised" indeed is not in the wordlist:

import nltk

english_words = set(nltk.corpus.words.words())

for w in english_words:
    if w.startswith("revise"):
        print(w)

prints the following list:

reviser
revise
revisee
revisership

Based on this source, section 4.1, this is where the word list originates from:

The Words Corpus is the /usr/share/dict/words file from Unix

So you'll have to decide for your use case if the provided word list from NLTK is enough or if you want to switch to a more complete (and bigger) one.

edited Feb 07 '19 at 14:10

answered Feb 07 '19 at 13:52

adrianus

3,141
1
22
41

1

I think you should also lemmatize your word, then you'll find it in the English words – Amir Imani Feb 07 '19 at 14:42

score 1 · Answer 2 · answered Feb 07 '19 at 13:47

1

Try this

from nltk.corpus import wordnet

if not wordnet.synsets(word_to_test):
  #Not an English Word
else:
  #English Word

answered Feb 07 '19 at 13:47

Amit Gupta

2,698
4
24
37

detect English words and nltk's words corpus

2 Answers2