Get all possible english words from a string

Question

Generate all possible combination of english words from a given string in python.

Input: godaddy Output: go, god, dad, add, daddy

Any good libraries?

Looks to me you want to use a *trie* (not to be confused with *tree*). — Willem Van Onsem, Apr 01 '17 at 17:34

score 5 · Accepted Answer · edited May 23 '17 at 12:02

5

Try enchant from http://pythonhosted.org/pyenchant/tutorial.html

>>> from nltk import everygrams
>>> import enchant
>>> word = 'godaddy'
>>> [''.join(_ngram) for _ngram in everygrams(word) if d.check(''.join(_ngram))]
['g', 'o', 'd', 'a', 'd', 'd', 'y', 'go', 'ad', 'god', 'dad', 'add', 'daddy']
>>> d = enchant.Dict("en_US")
# Exclude single char words.
>>> [''.join(_ngram) for _ngram in everygrams(word) if d.check(''.join(_ngram)) and len(_ngram) > 1]
['go', 'ad', 'god', 'dad', 'add', 'daddy']

But if it's all combinations of strings, regardless of whether it's a valid English word:

>>> list(everygrams(word))

Note

Any dictionary checking method would have its limitation:

>>> from nltk.corpus import words as english
>>> vocab = set(w.lower() for w in english.words())
>>> "google" in vocab
False
>>> "stackoverflow" in vocab
False

>>> import enchant
>>> d = enchant.Dict("en_US")
>>> d.check('StackOverflow')
False
>>> d.check('Stackoverflow')
False
>>> d.check('Google')
True

The "principled" way to do this task is to do language modeling at character level and have some probabilistic way to check whether the sequence of characters is more/less probable as English words.

Also, there are many Englishes in the world. A "valid" word in British English could be an unknown word in American English. See http://www.ucl.ac.uk/english-usage/projects/ice.htm and https://en.wikipedia.org/wiki/World_Englishes#Classification_of_Englishes

edited May 23 '17 at 12:02

Community

1
1

answered Apr 01 '17 at 17:42

alvas

115,346
109
446
738

What about `oda`, `da` or `od`? – Mazdak Apr 01 '17 at 17:49
Don't think they are valid "English" words =) – alvas Apr 01 '17 at 17:51
Yes they are. They are abbreviations. – Mazdak Apr 01 '17 at 17:54
Cringing... But there're surely better spellcheckers than `enchant`, the solution still remains to generate ngrams and then check through a spellchecker. – alvas Apr 01 '17 at 17:55
1

Yes but it's not that efficient when you can get all the words using `nltk.corpus`. – Mazdak Apr 01 '17 at 17:57
1

@Kasramvd Abbreviations aren't valid English words. – Nick is tired Apr 01 '17 at 17:58
@NickA If by valid you mean meaningful they are or if you mean they exist in dictionary still they are. If you don't want to count them you should probably mention that. – Mazdak Apr 01 '17 at 18:02
1

@Kasramvd If we take a word to mean a valid play in scrabble, abbreviations including acronyms such as oda are not words. – Nick is tired Apr 01 '17 at 18:06
@alvas How did the word got split? You missed a line maybe. – Austin p.b Apr 01 '17 at 18:59

score 2 · Answer 2 · answered Apr 01 '17 at 17:54

You can use nltk.corpus.words to create a set of all English words, then find the intersection of all possible words generated from your string with the English words:

In [56]: all_words = {st[i:j + i] for j in range(2, len(st)) for i in range(len(st)- j + 1)}

In [57]: english_vocab = set(w.lower() for w in nltk.corpus.words.words())

In [58]: english_vocab.intersection(all_words)
Out[58]: {'ad', 'add', 'addy', 'da', 'dad', 'daddy', 'go', 'god', 'od', 'oda'}

Note that the words like OD or oda are valid abbreviations.

score 0 · Answer 3 · answered Apr 01 '17 at 17:41

First, get a set of all English words. I expect there are many libraries that can do this, but recommendations for software libraries are off-topic for Stack Overflow, so just use whatever you can find.

Then, iterate through all substrings of the string, and see if any of them are in the collection.

words = #???
s = "godaddy"
for i in range(len(s)):
    for j in range(i+1, len(s)):
        substring = s[i:j+1]
        if substring in words:
            print(substring)

Result:

go
god
od
oda
da
dad
daddy
ad
add

Get all possible english words from a string

3 Answers3

Note