-3

I need to generate all possible words of Ki lengths given n characters for example:

given

LNDJOBEAWRL

do bear

I cant come up with the word of len 5 but this is the idea

n = 11
k1 = 2
k2 = 4
k3 = 5 

So basically all words of length 2 4 and 5 but without reusing characters. What would be the best way to do it?


My dictionary structure looks likes this:

{
    3: [{u'eit': u' "eit":0'}], 
    5: [{u'doosw': u' "woods": 4601, '}, {u'acenr': u' "caner": 0, '}, {u'acens': u' "canes": 0, '}, {u'acden': u' "caned": 0, '}, {u'aceln': u' "canel": 0,'}], 
    6: [{u'abeill': u' "alible": 0, '}, {u'cdeeit': u' "deciet":0,'}, {u'demoor': u' "mooder": 0, '}], 
    7: [{u'deiprss': u' "spiders": 0, '}, {u'deiprsy': u' "spidery": 0, '}, {u'cersttu': u' "scutter": 0, '}], 
    8: [{u'chiiilst': u' "chilitis": 0, '}, {u'agilnrtw': u' "trawling": 0, '}, {u'abdeemns': u' "beadsmen": 0, '}], 
    9: [{u'abeiilnns': u' "biennials": 0, '}, {u'bclooortu': u' "oblocutor": 0, '}, {u'aabfiinst': u' "fabianist": 0, '}, {u'acdeiituz': u' "diazeutic": 0, '}, {u'aabfiimns': u' "fabianism": 0, '}, {u'ehnoooppt': u' "optophone": 0, '}], 
    10: [{u'aiilnoprtt': u' "tripolitan": 0, '}, {u'eeilprrsty': u' "sperrylite": 0, '}, {u'gghhiilttt': u' "lighttight": 0, '}, {u'aeegilrruz': u' "regularize": 0, '}, {u'ellnprtuuy': u' "purulently": 0, '}], 
    11: [{u'cdgilnoostu': u' "outscolding": 0, '}], 
    12: [{u'ceeeilnostuy': u' "leucosyenite": 0, '}, {u'aacciloprsst': u' "sarcoplastic": 0, '}], 
    13: [{u'acdeimmoprrsu': u' "cardiospermum": 0, '}, {u'celnnooostuvy': u' "noncovetously": 0, '}], 
    14: [{u'adeejmnnoprrtu': u' "preadjournment": 0, '}]
}

And my modified code looks like this:

wlen = self.table[pos]
if pos == 0:
    # See if the letters remaining in the bag are a valid word
    key = ''.join(sorted(bag.elements()))

    for d in wlen:
        if key in d.keys():
            yield solution + [key]
else:
    pos -= 1
    for dic in wlen:
        print(len(dic))
        for key in dic.keys():
PM 2Ring
  • 54,345
  • 6
  • 82
  • 182
Omar Hashmi
  • 35
  • 1
  • 1
  • 6
  • 2
    How are you defining what a word is? Are you using a dictionary? Or are you just looking for combinations of letters? – Mark Jun 28 '18 at 23:20
  • yes i have a dictionary with valid words. I want to lookup all the possible words of length 2 4 5 in my dictionary that can be generated with given characters – Omar Hashmi Jun 28 '18 at 23:29
  • So just read all the words from the dictionary and save the ones that are the correct length, e.g. in a list or set. – PM 2Ring Jun 28 '18 at 23:32
  • yes exactly without reusing any chars – Omar Hashmi Jun 28 '18 at 23:33
  • I think you need to explain your goal more clearly. Do you mean you want to find sets of 3 words, with the 1st word in each set having 2 letters, the 2nd 4 letters, the 3rd 5 letters, with each letter from your base string e.g. `LNDJOBEAWRL` being used exactly once in those 3 words? – PM 2Ring Jun 28 '18 at 23:38
  • yea thats exactly what i want but those chars can be repeated in the words, each char just can't exceed its count in the resultant words. so for example L can be used twice in whatever words I make. Most others can only be used once because there is just one of those chars so if you use it one word, you can't use the same char again in another word – Omar Hashmi Jun 28 '18 at 23:43
  • this is big data problem i think, I might have to use spark rdds to store all the permutations – Omar Hashmi Jun 28 '18 at 23:47
  • Yea please I'd really appreciate your help. I have been thinking about this for 2 days now :( – Omar Hashmi Jun 29 '18 at 00:00
  • I don't understand why that dictionary has those keys. Why does it have letters that aren't in 'LNDJOBEAWRL' ? Please don't make substantial changes to your question after it's already received valid answers. – PM 2Ring Jul 01 '18 at 02:15
  • that's just part of a generic english words dictionary. 'LNDJOBEAWRL' is just an example string to find words. My dictionary has length as key and values as list of dictionary with words of that length. I really need to preserve this structure – Omar Hashmi Jul 01 '18 at 02:25

2 Answers2

0

First thing is standardize the words such that two words that are anagrams of each other will be handled exactly the same. We can do this by converting to lowercase and sorting the letters of the word. The next step is to distinguish between multiple occurrences of a give letter. To do this, we'll map each letter to a symbol containing the letter, and a number representing its occurrence in the string.

target = "LNDJOBEAWRL".lower()
symbols = sorted([c + str(target[i+1:].count(c)) for i, c in enumerate(target)])

Now that we have a standard representation for each word, we need a quick way to check if any permutations will match them. For this we use the trie datastructure. Here is some starter code for one:

class Trie:
    def __init__(self, symbol):
        self.symbol = symbol
        self.words = []
        self.children = dict()

    def add_word(self, word):
        self.words.append(word)

    def add_child(self, symbol, trie):
        self.children[symbol] = trie

Now you need to make an empty trie as the root, with anything as a symbol, dedicated to holding all top-level tries. Then iterate over each word we transformed earlier, and for the first symbol we produced, check if the root trie has a child with that symbol. If it does not, create a trie for it and add it. If it does, proceed to the next symbol, and check if a trie with that symbol is in the previous trie. Proceed in this fashion until you've exhausted all symbols, in which case the current trie node represents the standardized form for that word we transformed. Store the original word in this trie, and proceed to the next word.

Once that's done, your entire word list will be contained in this trie datastructure. Then, you can just do something like:

def print_words(symbols, node):
    for word in node.words:
        print(word)
    for sym in node.children:
        if sym in symbols:
            print_words(symbols, node.children[sym])

print_words(symbols, root_trie)

To print all words that can be composed from the symbols for the target word.

Dillon Davis
  • 6,679
  • 2
  • 15
  • 37
  • What if my dictionary is huge too like 100,000 entries? – Omar Hashmi Jun 29 '18 at 13:32
  • Are you talking about time to construct the trie, or time to query it? Because the time it takes to query is proportional to the total number of characters of all strings matched, in the worst case. In the best case, most matches share many of the same parent nodes, saving time. I'm not sure how much faster you can make a query using a different structure. If you're talking about time to construct- sort the symbol lists generated, and abuse the fact adjacent elements will share a large portion of their trie path to avoid unnecessary traversal in the trie. – Dillon Davis Jun 29 '18 at 15:18
  • sorry I meant to ask that question for the second answer – Omar Hashmi Jun 29 '18 at 15:43
0

The code below uses a recursive generator to build solutions. To store the target letters we use a collections.Counter, this acts like a set that permits repeated items.

To simplify searching we create a dictionary for each word length that we want, storing each of those dictionaries in a dictionary called all_words, with the word length as the key. Each sub-dictionary stores lists of words containing the same letters, with the sorted letters as the key, eg 'aet': ['ate', 'eat', 'tea'].

I use the standard Unix '/usr/share/dict/words' word file. If you use a file with a different format you may need to modify the code that puts words into all_words.

The solve function starts its search with the smallest word length, and works up to the largest. That's probably the most efficient order if the set containing the longest words is the biggest, since the final search is performed by doing a simple dict lookup, which is very fast. The previous searches have to test each word in the sub-dictionary of that length, looking for keys that are still in the target bag.

#!/usr/bin/env python3

''' Create anagrams from a string of target letters and a list of word lengths '''

from collections import Counter
from itertools import product

# The Unix word list
fname = '/usr/share/dict/words'

# The target letters to use
target = 'lndjobeawrl'

# Word lengths, in descending order
wordlengths = [5, 4, 2]

# A dict to hold dicts for each word length.
# The inner dicts store lists of words containing the same letters,
# with the sorted letters as the key, eg 'aet': ['ate', 'eat', 'tea']
all_words = {i: {} for i in wordlengths}

# A method that tests if a word only contains letters in target
valid = set(target).issuperset

print('Scanning', fname, 'for valid words...')
count = 0
with open(fname) as f:
    for word in f:
        word = word.rstrip()
        wlen = len(word)
        # Only add words of the correct length, with no punctuation.
        # Using word.islower() eliminates most abbreviations.
        if (wlen in wordlengths and word.islower()
        and word.isalpha() and valid(word)):
            sorted_word = ''.join(sorted(word))
            # Add this word to the list in all_words[wlen],
            # creating the list if it doesn't exist
            all_words[wlen].setdefault(sorted_word, []).append(word)
            count += 1

print(count, 'words found')
for k, v in all_words.items():
    print(k, len(v))
print('\nSolving...')

def solve(pos, bag, solution):
    wlen = wordlengths[pos]
    if pos == 0:
        # See if the letters remaining in the bag are a valid word
        key = ''.join(sorted(bag.elements()))
        if key in all_words[wlen]:
            yield solution + [key]
    else:
        pos -= 1
        for key in all_words[wlen].keys():
            # Test that all letters in key are in the bag
            newbag = bag.copy()
            newbag.subtract(key)
            if all(v >= 0 for v in newbag.values()):
                # Add this key to the current solution and 
                # recurse to find the next key
                yield from solve(pos, newbag, solution + [key])

# Find all lists of keys that produce valid combinations
for solution in solve(len(wordlengths) - 1, Counter(target), []):
    # Convert solutions to tuples of words
    t = [all_words[len(key)][key] for key in solution]
    for s in product(*t):
        print(s)

output

Scanning /usr/share/dict/words for valid words...
300 words found
5 110
4 112
2 11

Solving...
('ad', 'jell', 'brown')
('do', 'jell', 'brawn')
('ow', 'jell', 'brand')
('re', 'jowl', 'bland')

FWIW, here are the results for

target = 'nobigword'
wordlengths = [4, 3, 2]

output

Scanning /usr/share/dict/words for valid words...
83 words found
4 31
3 33
2 7

Solving...
('do', 'big', 'worn')
('do', 'bin', 'grow')
('do', 'nib', 'grow')
('do', 'bow', 'grin')
('do', 'bow', 'ring')
('do', 'gin', 'brow')
('do', 'now', 'brig')
('do', 'own', 'brig')
('do', 'won', 'brig')
('do', 'orb', 'wing')
('do', 'rob', 'wing')
('do', 'rib', 'gown')
('do', 'wig', 'born')
('go', 'bid', 'worn')
('go', 'bin', 'word')
('go', 'nib', 'word')
('go', 'bow', 'rind')
('go', 'din', 'brow')
('go', 'now', 'bird')
('go', 'own', 'bird')
('go', 'won', 'bird')
('go', 'orb', 'wind')
('go', 'rob', 'wind')
('go', 'rib', 'down')
('go', 'row', 'bind')
('id', 'bog', 'worn')
('id', 'gob', 'worn')
('id', 'orb', 'gown')
('id', 'rob', 'gown')
('id', 'row', 'bong')
('in', 'bog', 'word')
('in', 'gob', 'word')
('in', 'dog', 'brow')
('in', 'god', 'brow')
('no', 'bid', 'grow')
('on', 'bid', 'grow')
('no', 'big', 'word')
('on', 'big', 'word')
('no', 'bow', 'gird')
('no', 'bow', 'grid')
('on', 'bow', 'gird')
('on', 'bow', 'grid')
('no', 'dig', 'brow')
('on', 'dig', 'brow')
('or', 'bid', 'gown')
('or', 'big', 'down')
('or', 'bog', 'wind')
('or', 'gob', 'wind')
('or', 'bow', 'ding')
('or', 'wig', 'bond')
('ow', 'bog', 'rind')
('ow', 'gob', 'rind')
('ow', 'dig', 'born')
('ow', 'don', 'brig')
('ow', 'nod', 'brig')
('ow', 'orb', 'ding')
('ow', 'rob', 'ding')
('ow', 'rid', 'bong')
('ow', 'rig', 'bond')

This code was written for Python 3. You can use it on Python 2.7 but you will need to change

yield from solve(pos, newbag, solution + [key])

to

for result in solve(pos, newbag, solution + [key]):
    yield result
PM 2Ring
  • 54,345
  • 6
  • 82
  • 182
  • Thank you that would do it! Thanks so much!! – Omar Hashmi Jun 30 '18 at 00:33
  • Hi i tried this solution but i m getting syntax error for yeild from because i m using python2.7. What would be the equivalent of that in python in 2.7. I have never used yeild – Omar Hashmi Jun 30 '18 at 21:40
  • @OmarHashmi To understand how this code works you need to be familiar with [recursion](https://en.wikipedia.org/wiki/Recursion_%28computer_science%29) and with Python [generators](https://docs.python.org/3/tutorial/classes.html#generators). You may also find this page helpful: [Understanding Generators in Python](https://stackoverflow.com/q/1756096/4014959); there are also various Python generators tutorials available online. – PM 2Ring Jul 01 '18 at 00:41
  • also I am confused why are you passing len(wordlengths) - 1 in solve. Shouldnt it be the elements in the wordlengths? My dictionary is a little bit different I changed your code to get the right elements but I am still getting 0 solutions, it looks like this {1: [{u'{': u'{'}, {u'}': u'}'}], 3: [{u'eit': u' "eit":0'}], 5: [{u'doosw': u' "woods": 4601, '}, {u'acenr': u' "caner": 0, '}, {u'acens': u' "canes": 0, '}, {u'acden': u' "caned": 0, '}, {u'aceln': u' "canel": 0,'}]}. – Omar Hashmi Jul 01 '18 at 00:50
  • @OmarHashmi The 1st arg to the `solve` function is `pos`, which is the index of an element in `wordlengths`. Then `solve` accesses that element with `wlen = wordlengths[pos]`. But it needs `pos` because it needs to know when `pos == 0`. There's something wrong with your code that creates the `all_words` dictionary. The inner dictionary keys should just be strings of letters, and the values should be lists of words with those letters, eg `'aet': ['ate', 'eat', 'tea']`. If you add a small sample of your word file to the end of your question I'll write some code that reads it correctly. – PM 2Ring Jul 01 '18 at 01:17
  • I kind of need that structure for something else my table looks like this {len(word): [sorted(word): {actualword:popularitynumber]} e.g {5: [{'doosw': "woods": 4601}, {acenr: "caner": 0}} can you help me with that structure of the table? we dont have to worry about popularity number – Omar Hashmi Jul 01 '18 at 01:45