0

I'm trying to create a code to find the optimal solution for scrabble. And i need help with generating all of the word combinations with a set of letters. (To clarify: i do not need help validating if the words are in a dictionary or not. I just want to generate all of the different combinations). Anyway, in scrabble there is a blank square/wild card which can be any of the valid letters, in my python code it is represented by '_'. Since the wild cards value is zero its important to keep track of what words originates from a wild card, therefore all wildcard words in my code is uppercase. It is also imported to keep in mind that the function does not return a list with any words that have a uppercase in it that can be also written in lowercase with the same set of letters, since i only want to keep highest scoring word combinations. Also keep in mind that it is possible to get multiple wild cards in scrabble (Though unlikely).

This is my fastest version of the code so far. But still in a worst case scenario with 5 letters and 2 wild cards it takes over a minute to run. Most of this time is spend removing the uppercase duplicates, so i have focused my optimization effort on that part. So far I have added numba, binary search (to speed up the processes of checking if the lowercase word already exists) and list comprehension but i still only got none to minor improvement. I think the only possible way to speed up my code is to use a another algorithm.

import numpy as np
import itertools, numba, time

def find_all_word_combinations(letters, wild_card='_', valid_letters='abcdefghijklmnopqrstuvwxyzæøå'):
    valid_letters = valid_letters.upper()
    return remove_uppercase_duplicates([
        ''.join(permutation) for replaced_words in replace_wild_card_with_letters(letters, wild_card, valid_letters)
            for length in range(len(replaced_words))
                for combination in itertools.combinations(replaced_words, length + 1)
                    for permutation in itertools.permutations(combination)])

def replace_wild_card_with_letters(word, wild_card='_', valid_letters='abcdefghijklmnopqrstuvwxyzæøå'):
    if wild_card not in word:
        return [word]
    valid_letters = valid_letters.upper()
    where = [i for i, x in enumerate(word) if wild_card == x]
    blanks_filled = [''.join(perm) for perm in itertools.combinations_with_replacement(valid_letters, word.count(wild_card))]
    result = []
    for letters in blanks_filled:
        new_word = list(word)
        for index, letter in enumerate(letters):
            new_word[where[index]] = letter
        result.append(''.join(new_word))
    return result

@numba.njit(fastmath=True)
def binary_search_check(arr:list, item:str)->bool:
    left = 0
    right = len(arr) - 1
    while left <= right:
        mid = (left + right) // 2
        if arr[mid] == item:
            return True
        elif arr[mid] > item:
            right = mid - 1
        else:
            left = mid + 1
    return False

@numba.njit(fastmath=True)
def remove_uppercase_duplicates(words):
    words = np.sort(words)
    return [word for word in words if not binary_search_check(words, word.lower())]

start = time.perf_counter()
print(len(find_all_word_combinations('abcde__')))
end = time.perf_counter()
print(f'Finished in {end-start} seconds.')

Edit: The code i first provided is wrong, the remove_uppercase_duplicates function does not function properly. The function is supposed to delete all duplicates and remove all uppercase words that have a lower case duplicate. Since someone has already posted a better answer i'm not going to fix the original code (i am also to lazy ¯\(ツ)/¯). Feel free to fix my error and post it. I also fixed some grammar.

2 Answers2

1

A couple strategies might help here. Two goals should stand out real quick:

  1. You really really want to avoid sorting the result, which is a huge overhead, especially when following it up with a binary search executed many times

  2. You really (just one "really") want to work with sets rather than lists here to avoid duplicates and make member checking rather instantaneous.

So here is a solution that does that. Note: There is probably a little bit left to do here if you really dig in on the wildcard replacement piece, but I'm not sure it is worth the effort. I didn't tweak your replacement function much. This computes all words in less than 2 secs.

Also note: I didn't QA this too much, you should probably do some set comparisons with the current method to ensure that the results match expectations.

Edit: 2 corrections. First, orig post had "permutations within the permutation" which is unnecessary and removed. Second, I neglected the fact that I reduced the size of letters by the number of wildcards, which affects the length limit. Both fixed, new total is ~2.3sec

Code:

import numpy as np
import itertools, numba, time

def find_all_word_combinations(letters, wild_card='_', valid_letters='abcdefghijklmnopqrstuvwxyzæøå'):
    print(f'analyzing: {letters}')
    non_wildcard_words = set()
    wildcard_words = set()

    # separate the base letters and count wild cards
    wildcard_count = 0
    while wild_card in letters:
        wildcard_count += 1
        letters.remove(wild_card)

    for length in range(0, len(letters) + 1):
        for combo in itertools.combinations(letters, length):
            # make what can be made without wildcards at this length...
            for perm in itertools.permutations(combo):
                non_wildcard_words.add(''.join(perm))
                # make wildcard words from this length+ (there will be some overlap)
            for i in range(1, wildcard_count + 1):  # needed to add back in wildcards 1-by-1 if multiple
                # append the wildcards if lenght is < max
                base = list(combo)  # needs to be a list, not tuple
                base += [wild_card]*i
                if len(base) > len(letters)+wildcard_count:  # too long...
                    continue
                # need to permute the base because the replacement function is position dependent...
                for perm_2 in itertools.permutations(base):
                    # print(base, perm_2)
                    wildcard_words.update(replace_wild_card_with_letters(list(perm_2)))

    return non_wildcard_words, wildcard_words


def replace_wild_card_with_letters(word, wild_card='_', valid_letters='abcdefghijklmnopqrstuvwxyzæøå'):
    if wild_card not in word:
        return {word,}
    #valid_letters = valid_letters.upper()
    where = [i for i, x in enumerate(word) if wild_card == x]
    blanks_filled = [''.join(perm) for perm in itertools.combinations_with_replacement(valid_letters, word.count(wild_card))]
    result = set()
    for letters in blanks_filled:
        new_word = word
        for index, letter in enumerate(letters):
            new_word[where[index]] = letter
        result.add(''.join(new_word))
    return result

start = time.perf_counter()
words, wildcard_words = find_all_word_combinations(list('abcde__'))
# print(words)
# print(wildcard_words)
print(f'created {len(words)} base words')
print(f'created {len(wildcard_words)} wildcard words')
print(f'intersection is {len(words & wildcard_words)}')
wildcard_words -= words  # remove the non-wildcard words from the set of wildcard words
print(f'reduced wildcard words to {len(wildcard_words)} by removing dupes')

end = time.perf_counter()
print(f'Finished in {end-start} seconds.')

Output (edited after code update):

analyzing: ['a', 'b', 'c', 'd', 'e', '_', '_']
created 326 base words
created 1932459 wildcard words
intersection is 325
reduced wildcard words to 1932134 by removing dupes
Finished in 2.3314543350024906 seconds.
AirSquid
  • 10,214
  • 2
  • 7
  • 31
  • There's another hidden devil in here. `combinations_with_replacement` is just that... combinations *not* permutations, so not all possible pairs of letters are considered in the wildcard generation in the schema above... may have a chance to come back to this later – AirSquid Jun 24 '23 at 20:27
  • Thanks for the code, however there exists one problem with it. It only returns letter in lowercase... So it is impossible to figure which letters originates from the a wild card, and therefore impossible to correctly score the word, since wild cards have a value of zero. I tried modifying your code by re-adding the "valid_letters = valid_letters.upper()" line in the replace_wild_card_with_letters, but this however broke the sets so now you get duplicate words. Do you think you could modify your program while keeping this in mind? – Maggi_Master Jun 25 '23 at 10:58
0

I would suggest indexing the word list by position and letter and using set intersections to find the list of words that match a series of tiles:

Setting it up:

with open("/usr/share/dict/words",'r') as f:
    words = f.read().split("\n")

print(len(words)) # my word list contains 235,977 words

# build a dictionary to get set of word indexes 
# for each combination of position+letter 
# same for each word length

# include backward words if allowed
words += [w[::-1] for w in words] 

posLetter  = dict()
wordLengths = dict()
for i,w in enumerate(words):
    for p,c in enumerate(w):
        posLetter.setdefault((p,c),set()).add(i)  # position+letter
    wordLengths.setdefault(len(w),set()).add(i)   # word length

Here is a function that uses set intersections to find words of a given length that match existing letters in a list of tiles (ignoring wildcards and empty cells which you can represent with any characters other than letters)

def findWords(tiles):
    wordIndexes = wordLengths.get(len(tiles),set()).copy() # word length
    for i,c in enumerate(tiles):
        if not c.isalpha(): continue
        wordIndexes &= posLetter.get((i,c),set()) # position/letter intersect
        if not wordIndexes: break
    return [words[i] for i in wordIndexes]

output:

tests = ["-l-*-ant","ox------","spa----"]

from timeit import timeit
for tiles in tests:
    print(tiles,":",timeit(lambda:findWords(tiles),number=1))
    print(findWords(tiles))

-l---ant : 0.004177336999999781
['aliquant', 'flippant', 'flagrant', 'elephant', 'olefiant', 'oliphant', 'fluitant', 'alterant', 'claimant', 'pleasant', 'albicant', 'slampant']
ox------ : 0.002012269999999816
['oxyurous', 'oxammite', 'oxanilic', 'oxycrate', 'oxydiact', 'oxyether', 'oxyethyl', 'oxyfatty', 'oxygenic', 'oxharrow', 'oxidable', 'oxidator', 'oxymoron', 'oxidizer', 'oxyphile', 'oxyphyte', 'oxpecker', 'oxtongue', 'oxyrhine', 'oxyamine', 'oxalamid', 'oxyaphia', 'oxyaster', 'oxystome', 'oxalemia', 'oxytocia', 'oxytocic', 'oxytocin', 'oxaluria', 'oxaluric']
spa---- : 0.0030339960000000943
['spaniel', 'spaning', 'spartle', 'sparver', 'spanker', 'spaherp', 'spasmed', 'spasmic', 'spannel', 'spanner', 'spanule', 'sparada', 'spastic', 'spacing', 'sparely', 'spathal', 'spathed', 'spathic', 'spahrep', 'sparger', 'spadger', 'spatial', 'sparing', 'sparked', 'spading', 'sparker', 'spadone', 'spatted', 'spatter', 'sparkle', 'spaedom', 'spaeman', 'spattle', 'spatula', 'sparkly', 'spatule', 'spairge', 'sparoid', 'spavied', 'spaviet', 'sparred', 'sparrer', 'spalder', 'sparrow', 'spawner', 'spaller', 'spayard', 'spaying', 'spancel', 'spandle', 'spanemy', 'spangle', 'spangly']

Note1: that backward words are returned in their backward order in order to facilitate tile matching using increasing indexes (e.g. 'spahrep' for 'perhaps')

Note2: A special case should be made for tile lists that don't contain any letters because those would return all words of the given length. Indexing by letter frequency and filtering base on available letters maybe needed to keep the set sizes within manageable limits

With a performance around 2-4 milliseconds for a specific list of tiles (1.2-1.6 if backward not allowed), the function could be called multiple times to check various segments and subscripts.

From this, a function to find playable words for a complete row of 15 tiles can be written:

from collections import Counter
def playableWords(tiles,available):
    avCount  = Counter(available)
    result   = []
    empty    = "-"
    wildcard = "*"
    prev     = empty
    for s,tile0 in enumerate(tiles[:-1]):
        if prev == empty:
           hasLetter = False 
           for e,tile1 in enumerate(tiles[s+1:],s+1):
               if tile1 != empty:
                   hasLetter = True
               if not hasLetter: continue
               if e<len(tiles)-1 and tiles[e+1] != empty:
                   continue
               if empty not in tiles[s:e+1]: continue
               for word in findWords(tiles[s:e+1]):
                   needed = [c for c,t in zip(word,tiles[s:e+1]) if t==empty]
                   if all(avCount[c]>=n for c,n in Counter(needed).items()):
                       result += [(word,tiles[s:e+1],"".join(needed),s)]
        prev = tile0
    return result

output:

for word, tiles, needed, start in playableWords("spa-----l---ant","rroweeph"):
    print(word, "tiles:",tiles,"from",start,"using:",needed)

spar tiles: spa- from 0 using: r
scape tiles: spa-- from 0 using: pe
spare tiles: spa-- from 0 using: re
snape tiles: spa-- from 0 using: pe
shape tiles: spa-- from 0 using: pe
sparer tiles: spa--- from 0 using: rer
sparrow tiles: spa---- from 0 using: rrow
wheel tiles: ----l from 4 using: whee
prowl tiles: ----l from 4 using: prow
whorl tiles: ----l from 4 using: whor
repel tiles: ----l from 4 using: repe
rowel tiles: ----l from 4 using: rowe
hewel tiles: ----l from 4 using: hewe
howel tiles: ----l from 4 using: howe
prowler tiles: ----l-- from 4 using: prower
herl tiles: ---l from 5 using: her
heel tiles: ---l from 5 using: hee
reel tiles: ---l from 5 using: ree
howl tiles: ---l from 5 using: how
weel tiles: ---l from 5 using: wee
peel tiles: ---l from 5 using: pee
whelp tiles: ---l- from 5 using: whep
whole tiles: ---l- from 5 using: whoe
perle tiles: ---l- from 5 using: pere
replow tiles: ---l-- from 5 using: repow
howler tiles: ---l-- from 5 using: hower
owl tiles: --l from 6 using: ow
pol tiles: --l from 6 using: po
eel tiles: --l from 6 using: ee
rel tiles: --l from 6 using: re
pole tiles: --l- from 6 using: poe
hole tiles: --l- from 6 using: hoe
role tiles: --l- from 6 using: roe
help tiles: --l- from 6 using: hep
poler tiles: --l-- from 6 using: poer
holer tiles: --l-- from 6 using: hoer
owler tiles: --l-- from 6 using: ower
plew tiles: -l-- from 7 using: pew
plow tiles: -l-- from 7 using: pow
elephant tiles: -l---ant from 7 using: eeph
orant tiles: --ant from 10 using: or
pant tiles: -ant from 11 using: p
hant tiles: -ant from 11 using: h
want tiles: -ant from 11 using: w
rant tiles: -ant from 11 using: r

This runs in 0.113 second (0.045 if backward not allowed) which would allows a whole 15x15 board to be analyzed in less than 3.5 seconds (or 1.4). With optimizations to store previous results and only re-analyze segments that are changed by the previous move, it should be possible to bring the response time under a second.

Alain T.
  • 40,517
  • 4
  • 31
  • 51