How can this Python Scrabble word finder be made faster?

Question

I have no real need to improve it, it's just for fun. Right now it's taking about a second on a list of about 200K words.

I've tried to optimize it as much as I know how (using generators instead of list comprehensions made a big difference), and I've run out of ideas.

Do you have any?

#!/usr/bin/env python
# let's cheat at scrabble

def count_letters(word):
  count = {} 
  for letter in word:
    if letter not in count: count[letter] = 0
    count[letter] += 1 
  return count 

def spellable(word, rack):
    word_count = count_letters(word)
    rack_count  = count_letters(rack)
    return all( [word_count[letter] <= rack_count[letter] for letter in word] )  

score = {"a": 1, "c": 3, "b": 3, "e": 1, "d": 2, "g": 2, 
         "f": 4, "i": 1, "h": 4, "k": 5, "j": 8, "m": 3, 
         "l": 1, "o": 1, "n": 1, "q": 10, "p": 3, "s": 1, 
         "r": 1, "u": 1, "t": 1, "w": 4, "v": 4, "y": 4, 
         "x": 8, "z": 10}

def score_word(word):
  return sum([score[c] for c in word])

def word_reader(filename):
  # returns an iterator
  return (word.strip() for word in  open(filename)) 

if __name__ == "__main__":
  import sys
  if len(sys.argv) == 2: 
    rack = sys.argv[1].strip()
  else:
    print """Usage: python cheat_at_scrabble.py <yourrack>"""
    exit()

  words = word_reader('/usr/share/dict/words')
  scored =  ((score_word(word), word) for word in words if set(word).issubset(set(rack)) and len(word) > 1 and spellable(word, rack))

  for score, word in sorted(scored):
    print str(score), '\t', word

You've missed a list comprehension at the end of `spellable`. And since that's in all, it could make a key difference: `all` with a generator expression will stop as soon as it finds one that's not true. — Thomas K, Mar 30 '11 at 11:40
`rack_count = count_letters(rack)` - you are calculating this for every `word`. Try to calculate it once before and then reuse. — eumiro, Mar 30 '11 at 11:50
You may preapre your data better. Store some additional info with every word, for instance letter count. — pajton, Mar 30 '11 at 12:00
Hi all, thanks for the suggestions. I made the two suggested changes and saw a slight improvement (a tenth of a second or so). I'm ashamed to admit that I've never learned to profile Python, I will search for information on how to do that. — , Mar 30 '11 at 12:01
For profiling under python2 you might take a look on the hotshot module. — resi, Mar 30 '11 at 13:32
http://stackoverflow.com/questions/582336/how-can-you-profile-a-python-script will help with profiling. — Stephen Paulger, Mar 30 '11 at 14:04
Just curious, are you including letters already on the board, or just the 7 letters in the rack? — Gilbert Le Blanc, Mar 30 '11 at 14:06

Justin Peel · Accepted Answer · 2011-04-02T18:58:38.960

Without going too far from your basic code, here are some fairly simple optimizations:

First, change your word reader to be:

def word_reader(filename, L):
  L2 = L+2
  # returns an iterator
  return (word.strip() for word in open(filename) \
          if len(word) < L2 and len(word) > 2)

and call it as

words = word_reader('/usr/share/dict/words', len(rack))

This gives the biggest improvement of all of my suggested changes. It eliminates words that are too long or short before we get too far in the process. Remember that word is unstripped of new line characters in my comparisons. I assumed '\n' line separators. Also, there could be a problem with the last word in the list because it probably won't have a new line at the end of it, but on my computer the last word is études which won't be found with our method anyway. Of course, you could just create your own dictionary beforehand from the original that removes those that aren't valid: those that aren't the right length or have letters outsize of a-z.

Next, Ferran suggested a variable for the rack set, which is a good idea. However, you are also getting a pretty major slow down from making a set out of every word. The purpose of using the sets at all was to weed out a lot of the ones that don't have any shot at all and thereby give a speed-up. However, I found it was even faster to just check if the first letter of the word was in the rack before calling spellable:

rackset = frozenset(rack)
scored =  [(score_word(word), word) for word in words if word[0] in rackset \
           and spellable(word, rack)]

However, this has to be accompanied by a change to spellable. I changed it to the following:

def spellable(word, rack):
    return all( [rack.count(letter) >= word.count(letter) \
                 for letter in set(word)] )

which, even without the changes in the previous step, is faster than what you currently have.

With the three changes above, the code was about 3x faster from my simple tests.

On to a better algorithm

Since what you are really doing is looking for anagrams, it makes sense to use an anagram dictionary. An anagram dictionary takes each word in a dictionary and groups them if they are anagrams. For instance, 'takes' and 'skate' are anagrams of each other because they are both equal to 'aekst' when sorted. I created an anagram dictionary as a text file with the format where on each line constitutes an entry. Each entry has the sorted version of the sorted version of the anagrams and then the anagrams themselves. For the example I'm using the entry would be

aekst skate takes

Then I can just take combinations of the rack letters and do a binary search for each one in the anagram dictionary to see if there is a match. For a 7-letter rack, there is a maximum of 120 unique scrabble-valid combinations of the letters. Performing a binary search is O(log(N)) so this will be very fast.

I implemented the algorithm in two parts. The first makes the anagram dictionary and the second is the actual scrabble cheating program.

Anagram dictionary creator code

f = open('/usr/share/dict/words')
d = {}
lets = set('abcdefghijklmnopqrstuvwxyz\n')
for word in f:
  if len(set(word) - lets) == 0 and len(word) > 2 and len(word) < 9:
    word = word.strip()
    key = ''.join(sorted(word))
    if key in d:
      d[key].append(word)
    else:
      d[key] = [word]
f.close()
anadict = [' '.join([key]+value) for key, value in d.iteritems()]
anadict.sort()
f = open('anadict.txt','w')
f.write('\n'.join(anadict))
f.close()

Scrabble cheating code

from bisect import bisect_left
from itertools import combinations
from time import time

def loadvars():
  f = open('anadict.txt','r')
  anadict = f.read().split('\n')
  f.close()
  return anadict

scores = {"a": 1, "c": 3, "b": 3, "e": 1, "d": 2, "g": 2, 
         "f": 4, "i": 1, "h": 4, "k": 5, "j": 8, "m": 3, 
         "l": 1, "o": 1, "n": 1, "q": 10, "p": 3, "s": 1, 
         "r": 1, "u": 1, "t": 1, "w": 4, "v": 4, "y": 4, 
         "x": 8, "z": 10}

def score_word(word):
  return sum([scores[c] for c in word])

def findwords(rack, anadict):
  rack = ''.join(sorted(rack))
  foundwords = []
  for i in xrange(2,len(rack)+1):
    for comb in combinations(rack,i):
      ana = ''.join(comb)
      j = bisect_left(anadict, ana)
      if j == len(anadict):
        continue
      words = anadict[j].split()
      if words[0] == ana:
        foundwords.extend(words[1:])
  return foundwords

if __name__ == "__main__":
  import sys
  if len(sys.argv) == 2:
    rack = sys.argv[1].strip()
  else:
    print """Usage: python cheat_at_scrabble.py <yourrack>"""
    exit()
  t = time()
  anadict = loadvars()
  print "Dictionary loading time:",(time()-t)
  t = time()
  foundwords = set(findwords(rack, anadict))
  scored = [(score_word(word), word) for word in foundwords]
  scored.sort()
  for score, word in scored:
    print "%d\t%s" % (score,word)
  print "Time elapsed:", (time()-t)

The anagram dictionary creator takes about a half a second on my machine. When the dictionary is already created, running the scrabble cheating program is about 15x faster than the OP's code and 5x faster than OP's code after my aforementioned changes. Also, the start-up time of loading the dictionary is much larger than actually searching for words from a rack, so this is a much better way for doing multiple searches at once.

You could also make an actual dictionary structure in python, and pickle it. That way you only have to load the pickle, and then lookups are essentially O(1). — Derek Halden, Mar 17 '15 at 10:37

Matt · Answer 2 · 2011-03-30T13:57:35.233

2

You can use the fact the /usr/dict/share/words dictionary is sorted to allow you to skip a lot of words in the dictionary without considering them at all.

For instance, suppose a dictionary word starts with "A" and you don't have "A" in the rack. You can do a binary search on the word list for the first word which starts with a "B" and skip all the words inbetween. This will make a big difference in most cases - you'll skip maybe half the words.

edited Mar 30 '11 at 13:57

answered Mar 30 '11 at 12:52

Matt

569
1
4
16

Nice idea. The OP could go through the dictionary once, and store a pointer to the first word of each letter. The OP would skip at least 19 of the 26 letters. – Gilbert Le Blanc Mar 30 '11 at 14:01
You might do even better by considering N letters at a time - that is: "if the first N letters of the word are not in the rack, skip to the first word in the dictionary which begins with a different N letter combination". I would guess the optimum is around N=2 or N=3... – Matt Mar 30 '11 at 14:04

Stephen Paulger · Answer 3 · 2011-03-30T21:58:28.817

import trie


def walk_trie(trie_node, rack, path=""):
    if trie_node.value is None:
        yield path
    for i in xrange(len(rack)):
        sub_rack = rack[:i] + rack[i+1:]
        if trie_node.nodes.has_key(rack[i]):
            for word in walk_trie(trie_node.nodes[rack[i]], sub_rack, path+rack[i]):
                yield word


if __name__ == "__main__":
    print "Generating trie... "

    # You might choose to skip words starting with a capital
    # rather than lower-casing and searching everything. Capitalised
    # words are probably pronouns which aren't allowed in Scrabble

    # I've skipped words shorter than 3 characters.
    all_words = ((line.strip().lower(), None) for line in open("/usr/share/dict/words") if len(line.strip()) >= 3)
    word_trie = trie.Trie(mapping=all_words)
    print "Walking Trie... "
    print list(walk_trie(word_trie.root, "abcdefg"))

Generating the trie takes a little while, but once generated getting the list of words should be much faster than looping over a list.

If anyone knows a way to serialise a trie that would be a great addition.

Just to demonstrate that generating the trie is what takes the time...

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
    98333    5.344    0.000    8.694    0.000 trie.py:87(__setitem__)
   832722    1.849    0.000    1.849    0.000 trie.py:10(__init__)
   832721    1.501    0.000    1.501    0.000 {method 'setdefault' of 'dict' objects}
    98334    1.005    0.000    1.730    0.000 scrabble.py:16(<genexpr>)
        1    0.491    0.491   10.915   10.915 trie.py:82(extend)
   196902    0.366    0.000    0.366    0.000 {method 'strip' of 'str' objects}
    98333    0.183    0.000    0.183    0.000 {method 'lower' of 'str' objects}
    98707    0.177    0.000    0.177    0.000 {len}
   285/33    0.003    0.000    0.004    0.000 scrabble.py:4(walk_trie)
      545    0.001    0.000    0.001    0.000 {method 'has_key' of 'dict' objects}
        1    0.001    0.001   10.921   10.921 {execfile}
        1    0.001    0.001   10.920   10.920 scrabble.py:1(<module>)
        1    0.000    0.000    0.000    0.000 trie.py:1(<module>)
        1    0.000    0.000    0.000    0.000 {open}
        1    0.000    0.000    0.000    0.000 trie.py:5(Node)
        1    0.000    0.000   10.915   10.915 trie.py:72(__init__)
        1    0.000    0.000    0.000    0.000 trie.py:33(Trie)
        1    0.000    0.000   10.921   10.921 <string>:1(<module>)
        1    0.000    0.000    0.000    0.000 {method 'split' of 'str' objects}
        1    0.000    0.000    0.000    0.000 trie.py:1(NeedMore)
        1    0.000    0.000    0.000    0.000 {method 'disable' of '_lsprof.Profiler' objects}

score 1 · Answer 4 · answered Mar 30 '11 at 13:11

you can convert more lists to generators:

all( [word_count[letter] <= rack_count[letter] for letter in word] )  
...
sum([score[c] for c in word])

to

all( word_count[letter] <= rack_count[letter] for letter in word ) 
...
sum( score[c] for c in word )

In the loop, instead of creating the rask set on every iteration, you can create it in advance, and it can be a frozenset.

rack_set = frozenset(rack)
scored =  ((score_word(word), word) for word in words if set(word).issubset(rask_set) and len(word) > 1 and spellable(word, rack))

The same can be done with the rack_count dictionary. It doesn't need to be created on every iteration.

rack_count  = count_letters(rack)

score 0 · Answer 5 · answered Mar 30 '11 at 14:48

0

Organize your data better. Instead of reading through a linear dictionary and comparing, you could pre-build a tree structure with those letter count vectors (well, "vectors"), and save that to a file.

answered Mar 30 '11 at 14:48

salezica

74,081
25
105
166

How can this Python Scrabble word finder be made faster?

5 Answers5

Linked