Generate all possible combination of english words from a given string in python.
Input: godaddy Output: go, god, dad, add, daddy
Any good libraries?
Generate all possible combination of english words from a given string in python.
Input: godaddy Output: go, god, dad, add, daddy
Any good libraries?
Try enchant
from http://pythonhosted.org/pyenchant/tutorial.html
>>> from nltk import everygrams
>>> import enchant
>>> word = 'godaddy'
>>> [''.join(_ngram) for _ngram in everygrams(word) if d.check(''.join(_ngram))]
['g', 'o', 'd', 'a', 'd', 'd', 'y', 'go', 'ad', 'god', 'dad', 'add', 'daddy']
>>> d = enchant.Dict("en_US")
# Exclude single char words.
>>> [''.join(_ngram) for _ngram in everygrams(word) if d.check(''.join(_ngram)) and len(_ngram) > 1]
['go', 'ad', 'god', 'dad', 'add', 'daddy']
But if it's all combinations of strings, regardless of whether it's a valid English word:
>>> list(everygrams(word))
See also:
Any dictionary checking method would have its limitation:
>>> from nltk.corpus import words as english
>>> vocab = set(w.lower() for w in english.words())
>>> "google" in vocab
False
>>> "stackoverflow" in vocab
False
>>> import enchant
>>> d = enchant.Dict("en_US")
>>> d.check('StackOverflow')
False
>>> d.check('Stackoverflow')
False
>>> d.check('Google')
True
The "principled" way to do this task is to do language modeling at character level and have some probabilistic way to check whether the sequence of characters is more/less probable as English words.
Also, there are many Englishes in the world. A "valid" word in British English could be an unknown word in American English. See http://www.ucl.ac.uk/english-usage/projects/ice.htm and https://en.wikipedia.org/wiki/World_Englishes#Classification_of_Englishes
You can use nltk.corpus.words
to create a set of all English words, then find the intersection of all possible words generated from your string with the English words:
In [56]: all_words = {st[i:j + i] for j in range(2, len(st)) for i in range(len(st)- j + 1)}
In [57]: english_vocab = set(w.lower() for w in nltk.corpus.words.words())
In [58]: english_vocab.intersection(all_words)
Out[58]: {'ad', 'add', 'addy', 'da', 'dad', 'daddy', 'go', 'god', 'od', 'oda'}
Note that the words like OD
or oda
are valid abbreviations.
First, get a set of all English words. I expect there are many libraries that can do this, but recommendations for software libraries are off-topic for Stack Overflow, so just use whatever you can find.
Then, iterate through all substrings of the string, and see if any of them are in the collection.
words = #???
s = "godaddy"
for i in range(len(s)):
for j in range(i+1, len(s)):
substring = s[i:j+1]
if substring in words:
print(substring)
Result:
go
god
od
oda
da
dad
daddy
ad
add