2

Generate all possible combination of english words from a given string in python.

Input: godaddy Output: go, god, dad, add, daddy

Any good libraries?

Austin p.b
  • 465
  • 7
  • 13

3 Answers3

5

Try enchant from http://pythonhosted.org/pyenchant/tutorial.html

>>> from nltk import everygrams
>>> import enchant
>>> word = 'godaddy'
>>> [''.join(_ngram) for _ngram in everygrams(word) if d.check(''.join(_ngram))]
['g', 'o', 'd', 'a', 'd', 'd', 'y', 'go', 'ad', 'god', 'dad', 'add', 'daddy']
>>> d = enchant.Dict("en_US")
# Exclude single char words.
>>> [''.join(_ngram) for _ngram in everygrams(word) if d.check(''.join(_ngram)) and len(_ngram) > 1]
['go', 'ad', 'god', 'dad', 'add', 'daddy']

But if it's all combinations of strings, regardless of whether it's a valid English word:

>>> list(everygrams(word))

See also:


Note

Any dictionary checking method would have its limitation:

>>> from nltk.corpus import words as english
>>> vocab = set(w.lower() for w in english.words())
>>> "google" in vocab
False
>>> "stackoverflow" in vocab
False

>>> import enchant
>>> d = enchant.Dict("en_US")
>>> d.check('StackOverflow')
False
>>> d.check('Stackoverflow')
False
>>> d.check('Google')
True

The "principled" way to do this task is to do language modeling at character level and have some probabilistic way to check whether the sequence of characters is more/less probable as English words.

Also, there are many Englishes in the world. A "valid" word in British English could be an unknown word in American English. See http://www.ucl.ac.uk/english-usage/projects/ice.htm and https://en.wikipedia.org/wiki/World_Englishes#Classification_of_Englishes

Community
  • 1
  • 1
alvas
  • 115,346
  • 109
  • 446
  • 738
2

You can use nltk.corpus.words to create a set of all English words, then find the intersection of all possible words generated from your string with the English words:

In [56]: all_words = {st[i:j + i] for j in range(2, len(st)) for i in range(len(st)- j + 1)}

In [57]: english_vocab = set(w.lower() for w in nltk.corpus.words.words())

In [58]: english_vocab.intersection(all_words)
Out[58]: {'ad', 'add', 'addy', 'da', 'dad', 'daddy', 'go', 'god', 'od', 'oda'}

Note that the words like OD or oda are valid abbreviations.

Mazdak
  • 105,000
  • 18
  • 159
  • 188
0

First, get a set of all English words. I expect there are many libraries that can do this, but recommendations for software libraries are off-topic for Stack Overflow, so just use whatever you can find.

Then, iterate through all substrings of the string, and see if any of them are in the collection.

words = #???
s = "godaddy"
for i in range(len(s)):
    for j in range(i+1, len(s)):
        substring = s[i:j+1]
        if substring in words:
            print(substring)

Result:

go
god
od
oda
da
dad
daddy
ad
add
Kevin
  • 74,910
  • 12
  • 133
  • 166