11

The NLTK word corpus does not have the phrase "okay", "ok", "Okay"?

> from nltk.corpus import words
> words.words().__contains__("check")
> True

> words.words().__contains__("okay")
> False

> len(words.words())
> 236736

Any ideas why?

MonsieurBeilto
  • 878
  • 1
  • 11
  • 18
  • Heh, TIL `zymolysis`, `okupukupu` and a lot of obscure words are in the corpus while `okay` is not. You should probably ask it in [nltk github](https://github.com/nltk/nltk) or use a different corpus but good catch. – umutto Jun 09 '17 at 04:33
  • The details of the contents of some particular corpus is not a programming question, even if someone can answer it, it's fairly clearly off-topic. Seems more sensible to ask the corpus maintainers. – pvg Jun 09 '17 at 04:42
  • I agree with @kindall – Sagar V Jun 22 '17 at 14:46

2 Answers2

12

TL;DR

from nltk.corpus import words
from nltk.corpus import wordnet 

manywords = words.words() + wordnet.words() 

In Long

From the docs, the nltk.corpus.words are words a list of words from "http://en.wikipedia.org/wiki/Words_(Unix)

Which in Unix, you can do:

ls /usr/share/dict/

And reading the README:

$ cd /usr/share/dict/
/usr/share/dict$ cat README
#   @(#)README  8.1 (Berkeley) 6/5/93
# $FreeBSD$

WEB ---- (introduction provided by jaw@riacs) -------------------------

Welcome to web2 (Webster's Second International) all 234,936 words worth.
The 1934 copyright has lapsed, according to the supplier.  The
supplemental 'web2a' list contains hyphenated terms as well as assorted
noun and adverbial phrases.  The wordlist makes a dandy 'grep' victim.

     -- James A. Woods    {ihnp4,hplabs}!ames!jaw    (or jaw@riacs)

Country names are stored in the file /usr/share/misc/iso3166.


FreeBSD Maintenance Notes ---------------------------------------------

Note that FreeBSD is not maintaining a historical document, we're
maintaining a list of current [American] English spellings.

A few words have been removed because their spellings have depreciated.
This list of words includes:
    corelation (and its derivatives)    "correlation" is the preferred spelling
    freen               typographical error in original file
    freend              archaic spelling no longer in use;
                    masks common typo in modern text

--

A list of technical terms has been added in the file 'freebsd'.  This
word list contains FreeBSD/Unix lexicon that is used by the system
documentation.  It makes a great ispell(1) personal dictionary to
supplement the standard English language dictionary.

Since it's a fixed list of 234,936, there are bound to be words that don't exist in that list.

If you need to extend your word list, you can add to the list using the words from WordNet using nltk.corpus.wordnet.words().

Most probably, all you need is a large enough corpus of text, e.g. Wikipedia dump and then tokenize it and extract all unique words.

alkasm
  • 22,094
  • 5
  • 78
  • 94
alvas
  • 115,346
  • 109
  • 446
  • 738
  • 2
    'There's a corpus that includes the word you're looking for' doesn't really answer 'why does this corpus not include this word'. Nor is it clear how 'TL;DR' is either applicable or explanatory. – pvg Jun 09 '17 at 06:35
  • @alvas Thanks for the work around. This is what I had already done, along with using a custom dictionary. So three dictionaries in total. Cool explanation – MonsieurBeilto Jun 10 '17 at 08:37
  • 2
    from nltk.corpus import wordnet as wn ? – Soumendra Jun 18 '17 at 20:46
  • 1
    @drinkcodesleeprepeat caught that too. Just sent an edit in for review on the post. – alkasm Jun 19 '17 at 23:58
  • 5
    get this: TypeError: can only concatenate list (not "dict_keyiterator") to list – cs0815 Feb 07 '19 at 14:36
2

I am unable to comment due to low reputation, but I can offer a couple of things. I've posted a zip file in the nltk_data issue related to this which contains a more comprehensive set of words merged in from Ubuntu18.04 /usr/share/dict/american-english

There are some grossly missing words in the original /usr/share/dict files, such as 'failed' and 'failings'. Unfortunately, using wordnet doesn't really resolve this; it adds 'fail-safe' and several types of failure such as 'equipment_failure' and 'renal_failure' but it doesn't add the basic words. Hopefully the supplied zipfile will be of some use.

Greg Nelson
  • 67
  • 1
  • 7