3

I know variations of this have been asked before, but I was unable to understand any of the previous implementations because most of them involved using sets and the issubset method.

Here is what I am trying to do: I have a set of words in a dictionary and a list of possible letters. I want to find if the members of the set can be formed through re-arranging the letters in the list. Here is my current implementation:

def solve(dictionary, letters):

    for word in dictionary: #for each word in the dictionary
        if len(word) > len(letters):   # first optimization, doesn't check words that are larger than letter set
            continue
        else: 
            scrambledword = "".join([b for b in sorted(list(word))]) #sorts the letters in each word
            if set(scrambledword).issubset(letters):
                print word


def main():
    dictionary = set([x.strip() for x in open("C:\\Python27\\dictionary.txt")])        
    letters = sorted(['v','r','o','o','m','a','b','c','d'])
    solve(dictionary, letters)


main()

The obvious problem with this implementation is that some words will be found that use more than one letter in "letters." For example, the word 'cardboard' appears as a valid word, despite there being only one copy of 'a' and 'r' in the letters list. How do I use the "issubset" method on lists?

ninjagecko
  • 88,546
  • 24
  • 137
  • 145
Parseltongue
  • 11,157
  • 30
  • 95
  • 160
  • Wouldn't 'cardboard' be filtered out by `if len(word) > len(letters):`? – senderle Jun 13 '11 at 04:33
  • No. Because the length of the word is equal to the length of the letter list. – Parseltongue Jun 13 '11 at 04:38
  • 1
    These **aren't anagrams**, which may explain some of the difficulty in searching for a solution. (Also, the system-wide Python installation folder is a **really bad** place to put your word list.) – Karl Knechtel Aug 01 '22 at 21:53

3 Answers3

5

To know if you can make a word out of a set of letters [oops, I did it myself -- I meant 'collection'!], you want every letter to occur at least the right number of times, so I think we're going to have to work the counts in there somehow. By definition, Python sets don't care about the number of elements in a source list. Maybe something like

from collections import Counter

letters = ['v','r','o','o','m','a','b','c','d']
words = 'cardboard boom booom'.split()
letterscount = Counter(letters)

for word in words:
    wordcount = Counter(word)
    print word, all(letterscount[c] >= wordcount[c] for c in wordcount)

giving

cardboard False
boom True
booom False

Counter is a handy utility class:

>>> c = Counter(letters)
>>> c
Counter({'o': 2, 'a': 1, 'c': 1, 'b': 1, 'd': 1, 'm': 1, 'r': 1, 'v': 1})
>>> c['o']
2
>>> c['z']
0

[DSM: the return! I removed a community edit which doesn't work because Counter instances aren't hashable.]

If searching speed is a concern, then you can trade off memory and precomputation time:

from collections import defaultdict, Counter
from itertools import combinations

# precomputations
allwords = open('/usr/share/dict/words').read().split() 
allwords = list(w for w in allwords if len(w) >= 3) # hack, /words contains lots of silliness
allwords_by_count = defaultdict(list)
for i, word in enumerate(allwords):
    allwords_by_count[frozenset(word)].append((word, Counter(word)))
    if i % 1000 == 0:
        print i, word


def wordsfrom(letters, words_by_count):
    lettercount = Counter(letters)
    for subsetsize in range(1, len(lettercount)+1):
        for subset in combinations(lettercount, subsetsize):
            for possword, posswordcount in words_by_count[frozenset(subset)]:
                if all(posswordcount[c] <= lettercount[c] for c in posswordcount):
                    yield possword

>>> wordsfrom('thistles', allwords_by_count)
<generator object wordsfrom at 0x1032956e0>
>>> list(wordsfrom('thistles', allwords_by_count))
['ess', 'sis', 'tit', 'tst', 'hei', 'hie', 'lei', 'lie', 'sie', 'sise', 'tie', 'tite', 'she', 'het', 'teth', 'the', 'els', 'less', 'elt', 'let', 'telt', 'set', 'sett', 'stet', 'test', 'his', 'hiss', 'shi', 'sish', 'hit', 'lis', 'liss', 'sil', 'lit', 'til', 'tilt', 'ist', 'its', 'sist', 'sit', 'shies', 'tithe', 'isle', 'sile', 'sisel', 'lite', 'teil', 'teli', 'tile', 'title', 'seit', 'sesti', 'site', 'stite', 'testis', 'hest', 'seth', 'lest', 'selt', 'lish', 'slish', 'hilt', 'lith', 'tilth', 'hist', 'sith', 'stith', 'this', 'list', 'silt', 'slit', 'stilt', 'liesh', 'shiel', 'lithe', 'shiest', 'sithe', 'theist', 'thesis', 'islet', 'istle', 'sistle', 'slite', 'stile', 'stilet', 'hitless', 'tehsil', 'thistle']

[Heh. I just noticed that 'thistles' itself isn't in the list, but that's because it's not in the words file..]

And yes, the apparent "nonwords" there are really in the words file:

>>> assert all(w in allwords for w in (wordsfrom('thistles', allwords_by_count)))
>>> 
DSM
  • 342,061
  • 65
  • 592
  • 494
  • It works well, but is prohibitively slow on very large lists. Let me make sure that I get how it works: so the loop will run through all letters in the word, insuring that every the frequency of letters in the "letter list" is greater than or equal to the frequency of letters in the actual word? Pretty smart implementation. How would I go about printing only the maximum lettered word? I want to use the "max()" function, but I don't know how it would work. – Parseltongue Jun 13 '11 at 04:55
  • I may have to take back the comment about it being "prohibitively slow", as I've now tested out previous implementations that I saw on other threads, and they are all exponentially slower. Kudos; but I'm still looking for any optimizations to this. – Parseltongue Jun 13 '11 at 05:02
  • @NinjaGecko: Error- line 24, in countsToWords.setdefault(key,{}).add(word) TypeError: unhashable type: 'Counter' – Parseltongue Jun 13 '11 at 05:24
  • 1
    See also: https://stackoverflow.com/questions/5485654/ – Karl Knechtel Aug 01 '22 at 22:03
1

If you look for anagrams, in other words you want to rearrange, but use all of them (as opposed to use only a subset) then there is another solution.

You first preprocess all the words in the dictionary. Given a word, you produce the word written with the same letters but in alphabetical order:

def alphabetize(word):
    "".join(sorted(word))

and put those new words in a set newDictionary Then your function can call alphabetize on letters and check whether the result in in the dictionary.

def solve(newDictionary, letters):
    query = alphabetize(letters)
    return query in newDictionary

The alphabetize function is a characteristic of anagrams: two words are anagrams of each other if and only if they produce the same result when alphabetize is applied to them.

Petar Ivanov
  • 91,536
  • 11
  • 82
  • 95
0

Importing collections, we define a hashable multiset:

def Letters(x):
    return frozenset(Counter(x).items())

We now preprocess the vocabulary into a dict of letters->{anagram1,anagram2,...}:

vocabulary = ['apple', 'banana', 'rats', 'star', 'tars']
countsToWords = defaultdict(lambda: set())
for word in vocabulary:
    countsToWords[Letters(word)].add(word)

Your 'solve' function now takes O(1) time:

def solve(query):
    return countsToWords[Letters(query)]

Example:

print( solve('star') )
# {'tars', 'star', 'rats'} 
ninjagecko
  • 88,546
  • 24
  • 137
  • 145
  • This implementation is very fast, but it will not give you words that have a length lower than the length of the string. For example solve('rabbit') will yield nothing – Parseltongue Jun 14 '11 at 03:01