How to find out wether a word exists in english using nltk

Question

I am looking for a proper solution to this question. This question has been asked many times before and I didn't find a single answer that suited. I need to use a corpus in NLTK to detect whether a word is an English word

I have tried to do :

wordnet.synsets(word)

This doesn't work for many common words. Using a list of words in English and performing lookup in a file is not an option. Using enchant is not an option either. If there is another library that can do the same, please provide the usage of the api. If not, please provide a corpus in nltk which has all the words in English.

Mazdak · Answer 1 · 2017-05-11T20:03:57.493

15

NLTK includes some corpora that are nothing more than wordlists. The Words Corpus is the /usr/share/dict/words file from Unix, used by some spell checkers. We can use it to find unusual or mis-spelt words in a text corpus, as shown in :

def unusual_words(text):
    text_vocab = set(w.lower() for w in text.split() if w.isalpha())
    english_vocab = set(w.lower() for w in nltk.corpus.words.words())
    unusual = text_vocab - english_vocab
    return sorted(unusual)

And in this case you can check the member ship of your word with english_vocab.

>>> import nltk
>>> english_vocab = set(w.lower() for w in nltk.corpus.words.words())
>>> 'a' in english_vocab
True
>>> 'this' in english_vocab
True
>>> 'nothing' in english_vocab
True
>>> 'nothingg' in english_vocab
False
>>> 'corpus' in english_vocab
True
>>> 'Terminology'.lower() in english_vocab
True
>>> 'sorted' in english_vocab
True

edited May 11 '17 at 20:03

answered Mar 17 '15 at 13:01

Mazdak

105,000
18
159
188

Would you know a way of doing this faster ? It take a lot of time for each verification – Nico Coallier May 11 '17 at 20:01
@NicoCoallier Are you using the `set`-based approach? – Mazdak May 11 '17 at 20:02
I am trying to identify english sentence.. http://stackoverflow.com/questions/43922087/define-if-post-extract-from-a-bilingual-facebook-page-are-in-english-using-pytho – Nico Coallier May 11 '17 at 20:03
1

@NicoCoallier Did you used `unusual_words` function, I updated function as it seemed the `text` argument should be split in order to perform the operations on words rather than characters :-). You can use the updated version. – Mazdak May 11 '17 at 20:05
Cheers for that, i'll try it ! Do you have an idea to solve my post ? – Nico Coallier May 11 '17 at 20:06
@NicoCoallier As other people's mentioned you should ask your question in codereview if you are looking for an improvement, or at least see what part is the bottle neck of your code and as about that part separately. – Mazdak May 11 '17 at 20:09

score 1 · Answer 2 · answered Jun 20 '17 at 05:45

I tried the above approach but for many words which should exist so I tried wordnet. I think this have more comprehensive vacabulary.-

from nltk.corpus import wordnet if wordnet.synsets(word): #Do something else: #Do some otherthing

score 0 · Answer 3 · edited Oct 15 '20 at 12:05

0

Based on my experience, found two options with NTLK:

1:

from nltk.corpus import words

unknown_word = []

if token not in words.words():    
    unknown_word.append(token)

2:

from nltk.corpus import wordnet

unknown_word = []

if len(wordnet.synsets(token)) == 0:    
    unknown_word.append(token)

Performance of option 2 is better. More relevant word got capture in option 2.

I will recommended to go for option 2.

edited Oct 15 '20 at 12:05

General Grievance

4,555
31
31
45

answered Oct 15 '20 at 11:24

Rahul Saha

1
1

For method 2: wordnet, many common valid words (of, an, the, and, about, above, because, etc) will be categorised as unknown using this method because "WordNet only contains "open-class words": nouns, verbs, adjectives, and adverbs. Thus, excluded words include determiners, prepositions, pronouns, conjunctions, and particles." See https://wordnet.princeton.edu/frequently-asked-questions – MattG Nov 27 '21 at 11:16

How to find out wether a word exists in english using nltk

3 Answers3

Linked