7

I'm trying to create a function best_tiles which takes in the number of tiles in your hand and returns the set of tiles that allows you to produce the most number of unique English-valid words, assuming that you can only use each tile once.

For example, with the set of tiles in your hand (A, B, C) you can produce the words, CAB, BAC, AB, and BA (all of these are English words), so you can spell 4 unique words with that set. With (B, A, A), you can spell 5 words: ABA, BAA, AA, AB, and BA. The goal is to find the set of letters which allows you to spell the most number of English-Valid words (without replacement).

So if 5 was the maximum number of words that could be spelled with any combination of letters for N = 3, running best_tiles( n = 3 ) would return B, A, A.

I'm wondering how to implement this efficiently? My current approach doesn't scale well at all with number of letters.

I read in a wordlist. In this case, I'm using enable.txt here: https://www.wordgamedictionary.com/enable/

import os
path = "enable.txt"
words = []
with open(path , encoding='utf8') as f: 
    for values in f:
        words.append(list(values.strip().upper()))

I create a function word_in_tiles h/t smack89 which returns whether it is possible to construct a word given a tile set:

def word_in_tiles(word, tiles):
  tiles_counter = collections.Counter(tiles)
  return all(tiles_counter.get(ch, 0) == cnt for ch,cnt in 
  collections.Counter(word).items())

I then create a function get_all_words which produces a list of all the possible words one can spell from a word list and a tile set.

def get_all_words(tile_set, words):
    # words is a word list
    return [i for i in words if word_in_tiles(i, tile_set)]

The extremely naive approach for identifying which tileset is the "best" for three letters is the following:

I first create a list of every possible combination for a given length. So for length 3, I'd do:

import string
import itertools 

letters = string.ascii_lowercase
all_three_letter_combinations = list(itertools.combinations_with_replacement(letters, 3))

# Create a list of only words that are three letters are less
three_letter = [i for i in words if len(i) <= 3]

sequence_dict = dict()
    for i in all_three_letter_combinations:
        string_test = "".join(i).upper()
        sequence_dict[i] = get_all_words(string_test, three_letter)

Then remove the values with no length and sort by the length of the result:

res = {k: v for k, v in sequence_dict.items() if len(v) >= 1}

def GetMaxLength(res):
    return max((len(v), v, k) for k, v in res.items())[1:]

GetMaxLength(res) You get that, for three letters, the tile-set that produces the most english valid words is T A E which can produce the following words ['AE', 'AT', 'ATE', 'EAT', 'ET', 'ETA', 'TA', 'TAE', 'TEA']

I'd like to be able to scale this up to as big as N = 15. What is the best procedure for doing this?

Guy Coder
  • 24,501
  • 8
  • 71
  • 136
Parseltongue
  • 11,157
  • 30
  • 95
  • 160
  • can't you just do something like `{''.join(comb) for i in range(len(word), 0, -1) for comb in combinations(word, i)}`? – acushner Oct 05 '20 at 18:21
  • this might be terrible, but might work?! – acushner Oct 05 '20 at 18:21
  • i did [this](https://github.com/sweettuse/anagrammer) if that helps – acushner Oct 05 '20 at 18:22
  • just realized that it should probably be `perms`, but take a look at the link – acushner Oct 05 '20 at 18:28
  • There are two problems with the approach in the question. 1) You'll be checking combinations that are clearly ridiculous, e.g. "ZZZ". In fact, any combination that only contains consonants can be immediately dismissed, and that would eliminate the majority of the combinations. 2) The number of combinations grows exponentially, but the number of words in the word list is a fixed, small constant. With 5 letters there are over 11 million combinations, but the number of words in the word list is much smaller than that. So analyzing the word list should be faster than trying letter combinations. – user3386109 Oct 05 '20 at 20:13
  • For the first example you gave, why is `BAC`, or `BCA` not part of the output? – smac89 Oct 05 '20 at 20:18
  • That's what I'm saying. `BAC` is not part of the output you showed in your example. So I think you are filtering the words based on words from a file? – smac89 Oct 05 '20 at 20:22
  • The question is still not making sense to me. That was just an example I gave, but what about the words `BC`, or `CA` for that first example; What criteria makes them **NOT an english word**? – smac89 Oct 05 '20 at 20:27
  • Ok I think I'm starting to understand. So it seems you are using a scrabble word dictionary to check if a word is valid or not? – smac89 Oct 05 '20 at 20:31
  • 1
    Unless my code is bugged, the solution to n = 15 is acdegilmnoprstu. – David Eisenstat Oct 09 '20 at 01:06

7 Answers7

4

I think this is good enough!

Here is a log of my code running under PyPy:

0:00:00.000232
E
0:00:00.001251
ER
0:00:00.048733
EAT
0:00:00.208744
ESAT
0:00:00.087425
ESATL
0:00:00.132049
ESARTP
0:00:00.380296
ESARTOP
0:00:01.409129
ESIARTLP
0:00:03.433526
ESIARNTLP
0:00:10.391252
ESIARNTOLP
0:00:25.651012
ESIARNTOLDP
0:00:56.642405
ESIARNTOLCDP
0:01:57.257293
ESIARNTOLCDUP
0:03:55.933906
ESIARNTOLCDUPM
0:07:17.146036
ESIARNTOLCDUPMG
0:10:14.844347
ESIARNTOLCDUPMGH
0:13:34.722600
ESIARNTOLCDEUPMGH
0:18:14.215019
ESIARNTOLCDEUPMGSH
0:22:47.129284
ESIARNTOLCDEUPMGSHB
0:27:56.859511
ESIARNTOLCDEUPMGSHBYK
0:46:20.448502
ESIARNTOLCDEUPMGSHBYAK
0:57:15.213635
ESIARNTOLCDEUPMGSHIBYAT
1:09:55.530180
ESIARNTOLCDEUPMGSHIBYATF
1:18:35.209599
ESIARNTOLCDEUPMGSHIBYATRF
1:21:54.095119
ESIARNTOLCDEUPMGSHIBYATRFV
1:20:16.978411
ESIARNTOLCDEUPMGSHIBYAOTRFV
1:14:24.253660
ESIARNTOLCDEUPMGSHIBYAONTRFV
1:00:37.405571

The key improvements are these.

  1. I distinguish not only between letters, but how many times the letter has been seen. Therefore every letter I can accept or move on. That was an idea I got while commenting on David Eisenstat's solution.
  2. From him I also got the idea that pruning trees out that can't lead to an answer controls the growth of the problem surprisingly well.
  3. The very first solution that I look at is simply all the top letters. This starts as a pretty good solution so despite it being depth first, we will prune pretty well.
  4. I am careful to consolidate "exhausted tries" into a single record. This reduces how much data we have to throw around.

And here is the code.

import os
import datetime
path = "enable.txt"
words = []
with open(path) as f:
    for values in f:
        words.append(values.strip().upper())

key_count = {}
for word in words:
    seen = {}
    for letter in word:
        if letter not in seen:
            seen[letter] = 0
        key = (letter, seen[letter])
        if key not in key_count:
            key_count[key] = 1
        else:
            key_count[key] += 1
        seen[letter] += 1


KEYS = sorted(key_count.keys(), key=lambda key: -key_count[key])
#print(KEYS)
#print(len(KEYS))
KEY_POS = {}
for i in range(len(KEYS)):
    KEY_POS[KEYS[i]] = i

# Now we will build a trie.  Every node has a list of words, and a dictionary
# from the next letter farther in the trie.
# BUT TRICK:, we will map each word to a sequence of numbers, and those numbers
# will be indexes into KEYS.  This allows us to use the fact that a second 'e' is
# unlikely, so we can deal with that efficiently.
class Trie:
    def __init__(self, path):
        self.words = []
        self.dict = {}
        self.min_pos = -1
        self.max_pos = -1
        self.words = []
        self.count_words = 0
        self.path = path

    def add_word (self, word):
        trie = self

        poses = []
        seen = {}
        for letter in word:
            if letter not in seen:
                seen[letter] = 0
            key = (letter, seen[letter])
            poses.append(KEY_POS[(key)])
            seen[letter] += 1
        sorted_poses = sorted(poses);
        for i in range(len(sorted_poses)):
            trie.count_words += 1
            pos = sorted_poses[i]
            if pos not in trie.dict:
                trie.dict[pos] = Trie(trie.path + KEYS[pos][0])
                if trie.max_pos < pos:
                    trie.max_pos = pos
            trie = trie.dict[pos]
        trie.count_words += 1
        trie.words.append(word)


base_trie = Trie('')
for word in words:
    base_trie.add_word(word);

def best_solution (size):
    def solve (subset, pos, best, partial):
        found = sum(x[0] for x in partial)
        upper_bound = sum(x[1] for x in partial)
        if size <= len(subset) or upper_bound < best or len(KEYS) <= pos:
            return (found, subset)
        if best < found:
            best = found
        # Figure out our next calculations.
        partial_include = []
        partial_exclude = []
        finalized_found = 0
        for this_found, this_bound, this_trie in partial:
            if this_trie is None:
                # This is a generic record of already emptied tries
                finalized_found += this_found
            elif pos in this_trie.dict:
                include_trie = this_trie.dict[pos]
                partial_include.append((
                    this_found + len(include_trie.words),
                    include_trie.count_words + this_found,
                    include_trie
                ))
                # We included the tally of found words in the previous partial.
                # So do not double-count by including it again
                partial_include.append((
                    0,
                    this_bound - include_trie.count_words - this_found,
                    this_trie
                ))
                partial_exclude.append((
                    this_found,
                    this_bound - include_trie.count_words,
                    this_trie
                ))
            elif this_found == this_bound:
                finalized_found += this_found
            else:
                partial_include.append((
                    this_found,
                    this_bound,
                    this_trie
                ))

                partial_exclude.append((
                    this_found,
                    this_bound,
                    this_trie
                ))
        if 0 < finalized_found:
            partial_include.append(
                (finalized_found, finalized_found, None)
            )
            partial_exclude.append(
                (finalized_found, finalized_found, None)
            )

        found_include, subset_include = solve(subset + [pos], pos+1, best, partial_include)
        if best < found_include:
            best = found_include
        found_exclude, subset_exclude = solve(subset, pos+1, best, partial_exclude)
        if found_include < found_exclude:
            return (found_exclude, subset_exclude)
        else:
            return (found_include, subset_include)


    count, subset = solve([], 0, 0, [(len(base_trie.words), base_trie.count_words, base_trie)])
    return ''.join([KEYS[x][0] for x in subset])

for i in range(20):
    start = datetime.datetime.now()
    print(best_solution(i))
    print(datetime.datetime.now() - start)
btilly
  • 43,296
  • 3
  • 59
  • 88
  • This is truly amazing. Thanks so much for the hard work. – Parseltongue Oct 08 '20 at 23:32
  • 1
    @Parseltongue It was a fun problem. I'm actually astounded at how well this is running. If it was rewritten in a fast language, finding optimal solutions of size 30 is probably doable! – btilly Oct 09 '20 at 01:15
  • I find it very interesting that the best solutions have no repeat letters. I would have guessed at N=15 we'd get at least one repeat! – Parseltongue Oct 09 '20 at 02:52
  • 1
    @Parseltongue We are now into repeats. It is interesting that it does not always simply add letters. For example from 22 to 23 we lose a `K` and add a second `I` and a `T`. – btilly Oct 09 '20 at 16:56
  • Interestingly the slowest seems to have been 25. We are now speeding up. If left to run for a week this program could probably find all answers. – btilly Oct 09 '20 at 22:50
  • Hmm, very interesting that it sped up after 25... I wonder what could cause that. – Parseltongue Oct 09 '20 at 23:51
  • @Parseltongue The list of open tries probably got smaller as we completed words. This gave me an idea for how to speed it up even more, but that idea failed to be a speedup. – btilly Oct 10 '20 at 19:20
3

This code can optimize n=15 in a couple minutes using PyPy on my laptop, finding

10701 acdegilmnoprstu.

The idea is to do branch and bound, where at each node some letters are forced to be included, and others are excluded. We derive an upper bound on the quality of each node by finding an order-preserving map f (preserving the partial order of multiset inclusion) from multisets of letters to a smaller partially ordered space, then counting the number of words we can get where f(word) is included in the best f(tiles). On the smaller space we can brute force the problem using a fast convolution method (reminiscent of FFT).

To find a good space, we greedily remove letters one at a time so as to affect as few words as possible, until the upper bound can be brute forced.

import array
import collections
import functools
import heapq


def count_occurrences_of_letters(raw_word):
    occurs = collections.Counter()
    word = []
    for letter in raw_word:
        word.append(letter + str(occurs[letter]))
        occurs[letter] += 1
    return word


def greedy_censorship_order(words):
    hits = collections.defaultdict(set)
    for index, word in enumerate(words):
        for letter in word:
            hits[letter].add(index)
    order = []
    while hits:
        censored_letter = min(hits.keys(), key=lambda letter: len(hits[letter]))
        order.append(censored_letter)
        for index in hits[censored_letter]:
            for letter in words[index]:
                if letter != censored_letter:
                    hits[letter].remove(index)
        del hits[censored_letter]
    return order


def bitmap_from_word(word, alphabet):
    bitmap = 0
    censored = 0
    for letter in word:
        try:
            bitmap |= 1 << alphabet.index(letter)
        except ValueError:
            censored += 1
    return bitmap, censored


def sum_over_subsets(vector, dimension):
    for i in range(dimension):
        bit = 1 << i
        for bitmap in range(1 << dimension):
            if not (bitmap & bit):
                vector[bitmap | bit] += vector[bitmap]


def count_set_bits(n):
    return bin(n).count("1")


@functools.total_ordering
class Node:
    def __init__(self, subset, n, unfiltered_words):
        self.subset = subset
        self.n = n
        self.words = [word for word in unfiltered_words if len(word) <= n]
        self.upper_bound = sum(not word for word in self.words)
        if n == 0:
            return
        order = greedy_censorship_order(self.words)
        if not order:
            self.pivot = None
            return
        self.pivot = order[-1]
        alphabet = order[-(n + 7) :]
        zeros = [0] * (1 << len(alphabet))
        vectors = [array.array("l", zeros) for i in range(n + 1)]
        for word in self.words:
            bitmap, censored = bitmap_from_word(word, alphabet)
            for i in range(censored, n + 1):
                vectors[i][bitmap] += 1
        for vector in vectors:
            sum_over_subsets(vector, len(alphabet))
        self.upper_bound = max(
            vectors[n - count_set_bits(bitmap)][bitmap]
            for bitmap in range(1 << len(alphabet))
            if count_set_bits(bitmap) <= n
        )

    def children(self):
        if self.pivot is None:
            return
        yield Node(
            self.subset, self.n, [word for word in self.words if self.pivot not in word]
        )
        yield Node(
            self.subset | {self.pivot},
            self.n - 1,
            [
                [letter for letter in word if letter != self.pivot]
                for word in self.words
            ],
        )

    def __eq__(self, other):
        return self.upper_bound == other.upper_bound

    def __ne__(self, other):
        return self.upper_bound != other.upper_bound

    def __lt__(self, other):
        return self.upper_bound > other.upper_bound


def solve(n, words):
    heap = [Node(set(), n, words)]
    while True:
        top = heapq.heappop(heap)
        print(top.upper_bound, "".join(sorted(letter[0] for letter in top.subset)))
        if top.n == 0:
            return
        for child in top.children():
            heapq.heappush(heap, child)


def main():
    with open("enable.txt") as file:
        raw_words = file.read().split()
    words = [count_occurrences_of_letters(word) for word in raw_words]
    solve(15, words)


if __name__ == "__main__":
    main()
David Eisenstat
  • 64,237
  • 7
  • 60
  • 120
  • 1
    This gives me an idea. Step 1, create a frequency of first letter, second letter, etc. Then replace the letters with indexes into that list. There are only 109 of those. Now instead of starting with all of the letters, start with, say, the top `k+5` on your list. Do your subset trick. And then for each following index, cycle through your subsets, either choosing to add or not, and filter it down again. The hope being that the number of subsets is limited by the fact that the top solution can't include many unpopular choices down the list. – btilly Oct 08 '20 at 16:09
  • 1
    It is worth noting that for lengths 2-8, the best answers are all limited to the top 14 on that list. It is also worth noting that a second 'E' isn't in the optimal solution, but does show up ahead of 'U'. And the top 20 symbols include a second 'E', a second 'S' and a second 'I'. – btilly Oct 08 '20 at 16:11
  • Very cool! Using my `get_all_words` function above, I get that the N = 15 solution discovered by you (`acdegilmnoprstu`) and the one discovered by @btilly `ESIARNTOLCDUPMG`, both yield exactly 10755 results. Fantastic work. Thanks a lot for doing this -- I've learned a lot. Is there a way to award two people bounties? – Parseltongue Oct 09 '20 at 02:59
  • On my laptop your code takes 3 minutes, so your code is about 3x as fast as mine. I'm not sure how much of that is more efficient data structures (eg bitmaps instead of lists) and how much is a better algorithm, I suspect a bit of both. Well done! Also thanks for pointing me at PyPy. It really does make a nice difference! – btilly Oct 09 '20 at 04:43
  • Out of curiosity I decided to run yours for 20 to see how time would compare. Yours started off at 22 GB of RAM while mine took only half a GB. In the end yours took the same time as mine. So it looks like mine is scaling better. – btilly Oct 09 '20 at 05:21
  • @btilly The branch and bound part I tossed together in a few minutes without worrying about copying the word list a ton of times. Low hanging fruit there. – David Eisenstat Oct 09 '20 at 12:15
  • 1
    @Parseltongue no, but I don't do this for the rep at this point. Thanks for the cool problem! – David Eisenstat Oct 09 '20 at 14:45
  • 2
    @Parseltongue acdegilmnoprstu and ESIARNTOLCDUPMG are the same set of letters. – גלעד ברקן Oct 10 '20 at 17:10
  • @Parseltongue my calculation yielded 10701, conforming with the number of valid words for the 15 letter combination David Eisenstat and btilly found. There could possibly be a discrepancy between the lists or in the function, `get_all_words`, that's yielding 10755 for you. – גלעד ברקן Oct 11 '20 at 02:10
2

Here's a "dumb" sum-over-subsets that accumulates, for each count of 1 to 26, the selection of distinct letters that yields the most words in the file "enable.txt" in under 33 seconds on my laptop. (The 32 seconds is a speedup by David Eisenstat, changing my original code that ran in 6 minutes 45 seconds to an in-place method).

Since btilly and David Eisenstat already conducted the difficult work of optimising a search that would also include duplicates, we know that the information here up to 16 letters is useful.

from collections import defaultdict

def as_number(word):
    word = word.lower();
    n = 0
    for c in word:
        m = ord(c) - 97
        if n & (1 << m):
            return 0
        else:
            n |= 1 << m
    return n

def get_letters(n):
    letters = ""
    i = 0
    while n:
        if n & 1:
            letters += chr(97 + i)
        n >>= 1
        i += 1
    return letters

def f(words, N):
    hash = defaultdict(lambda: 0) #[0] * (1 << N)

    for w in words:
        num = as_number(w)
        if num:
            hash[num] += 1 #= -~hash[num]
  
    dp = [hash.get(mask, 0) for mask in range(1 << N)]

    for i in range(N):
        for mask in range(1 << N):
            if not (mask & (1 << i)):
                dp[mask ^ (1 << i)] += dp[mask]

    result = {}

    for i in xrange(1, 1 << N):
        k = bin(i).count("1")
        if k in result:
            if result[k]["best"] == dp[i]:
                result[k]["best_letters"].append(get_letters(i))
            elif result[k]["best"] < dp[i]:
                result[k]["best"] = dp[i]
                result[k]["best_letters"] = [get_letters(i)]
        elif dp[i]:
            result[k] = {
                "best": dp[i],
                "best_letters": [get_letters(i)]
            }

    return result

import os
import datetime
path = "enable.txt"
words = []
with open(path) as file:
    for values in file:
        words.append(values.strip())

start = datetime.datetime.now()
print f(words, 26)
print(datetime.datetime.now() - start)

Output:

// ♥ pypy py.py
{
    2: {
        'best': 2,
        'best_letters': ['ab', 'de', 'ah', 'eh', 'al', 'am', 'em', 'an', 'en', 'do', 'ho', 'mo', 'no', 'er', 'is', 'os', 'at', 'it', 'mu', 'nu', 'ow', 'ay', 'oy']
    },
    3: {
        'best': 9,
        'best_letters': ['aet']
    },
    4: {
        'best': 24,
        'best_letters': ['aest']
    },
    5: {
        'best': 66,
        'best_letters': ['aelst']
    },
    6: {
        'best': 150,
        'best_letters': ['aeprst']
    },
    7: {
        'best': 283,
        'best_letters': ['aeoprst']
    },
    8: {
        'best': 543,
        'best_letters': ['aeilprst']
    },
    9: {
        'best': 945,
        'best_letters': ['aeilnprst']
    },
    10: {
        'best': 1590,
        'best_letters': ['aeilnoprst']
    },
    11: {
        'best': 2557,
        'best_letters': ['adeilnoprst']
    },
    12: {
        'best': 3855,
        'best_letters': ['acdeilnoprst']
    },
    13: {
        'best': 5648,
        'best_letters': ['acdeilnoprstu']
    },
    14: {
        'best': 8001,
        'best_letters': ['acdeilmnoprstu']
    },
    15: {
        'best': 10701,
        'best_letters': ['acdegilmnoprstu']
    },
    16: {
        'best': 14060,
        'best_letters': ['acdeghilmnoprstu']
    },
    17: {
        'best': 17225,
        'best_letters': ['abcdeghilmnoprstu']
    },
    18: {
        'best': 20696,
        'best_letters': ['abcdeghilmnoprstuy']
    },
    19: {
        'best': 23723,
        'best_letters': ['abcdeghiklmnoprstuy']
    },
    20: {
        'best': 26542,
        'best_letters': ['abcdefghiklmnoprstuy']
    },
    21: {
        'best': 29501,
        'best_letters': ['abcdefghiklmnoprstuwy']
    },
    22: {
        'best': 31717,
        'best_letters': ['abcdefghiklmnoprstuvwy']
    },
    23: {
        'best': 32675,
        'best_letters': ['abcdefghiklmnoprstuvwyz']
    },
    24: {
        'best': 33548,
        'best_letters': ['abcdefghiklmnoprstuvwxyz']
    },
    25: {
        'best': 34299,
        'best_letters': ['abcdefghijklmnoprstuvwxyz']
    },
    26: {
        'best': 34816,
        'best_letters': ['abcdefghijklmnopqrstuvwxyz']
    }
}
0:00:32.295888
גלעד ברקן
  • 23,602
  • 3
  • 25
  • 61
  • Note that it's possible to run the DP in place, with a significant speedup: ` dp = [hash.get(mask, 0) for mask in range(1 << N)] for i in range(N): for mask in range(1 << N): if not (mask & (1 << i)): dp[mask ^ (1 << i)] += dp[mask] ` – David Eisenstat Oct 11 '20 at 02:07
  • @DavidEisenstat OMG. It's now 32 seconds. Thanks! – גלעד ברקן Oct 11 '20 at 02:33
  • This is very nice. Thanks a lot! If I could reward another bounty, I would. – Parseltongue Oct 11 '20 at 18:22
  • 1
    @Parseltongue oh, no, it's not worth a bounty. The work for this question must include the possibility of duplicates, which this answer does not. But the speed is kind of cool. – גלעד ברקן Oct 11 '20 at 19:03
2

This is my previous answer with another idea added that is borrowed from David Eisenstat. That idea being to start with a data structure that squashes out differences caused by common letters. The speedup was quite good.

Here is a full log of the code running under PyPy. Once you've gone pretty far, it goes very, very fast because the data structure is very small. The slowest was 23 letters.

0:00:00.000226

0:00:00.000095
ER
0:00:00.000514
EAT
0:00:00.014545
ESAT
0:00:00.069521
ESATL
0:00:00.063586
ESARTP
0:00:00.100273
ESARTOP
0:00:00.309665
ESIARTLP
0:00:01.171279
ESIARNTLP
0:00:02.435968
ESIARNTOLP
0:00:05.049897
ESIARNTOLDP
0:00:11.358067
ESIARNTOLCDP
0:00:23.378628
ESIARNTOLCDUP
0:00:38.672561
ESIARNTOLCDUPM
0:00:57.530691
ESIARNTOLCDUPMG
0:01:23.665592
ESIARNTOLCDUPMGH
0:01:43.571878
ESIARNTOLCDEUPMGH
0:02:02.831546
ESIARNTOLCDEUPMGSH
0:02:13.175045
ESIARNTOLCDEUPMGSHB
0:02:15.644423
ESIARNTOLCDEUPMGSHBY
0:02:25.681542
ESIARNTOLCDEUPMGSHBYK
0:02:38.858906
ESIARNTOLCDEUPMGSHBYAK
0:02:44.703865
ESIARNTOLCDEUPMGSHIBYAT
0:02:52.872105
ESIARNTOLCDEUPMGSHIBYATF
0:02:38.724424
ESIARNTOLCDEUPMGSHIBYATRF
0:02:19.892021
ESIARNTOLCDEUPMGSHIBYATRFV
0:02:01.069136
ESIARNTOLCDEUPMGSHIBYAOTRFV
0:01:39.992728
ESIARNTOLCDEUPMGSHIBYAONTRFV
0:01:09.800219
ESIARNTOLCDEUPMGSHIBYAONTRFVK
0:00:46.791619
ESIARNTOLCDEUPMGSHIBYAONTRFVKW
0:00:30.404072
ESIARNTOLCDEUPMGSHIBYAONTRFVLKW
0:00:17.932808
ESIARNTOLCDEUPMGSHIBYAONTRFVLKWC
0:00:16.610114
ESIARNTOLCDEUPMGSHIBYAONTRFVLKWEC
0:00:13.385111
ESIARNTOLCDEUPMGSHIBYAONTRFVLKWECZ
0:00:11.731393
ESIARNTOLCDEUPMGSHIBYAONTRFVLKWESCZ
0:00:09.263752
ESIARNTOLCDEUPMGSHIBYAONTRFVLKWESCZD
0:00:08.459759
ESIARNTOLCDEUPMGSHIBYAONTRFVLKWESCZDP
0:00:06.778232
ESIARNTOLCDEUPMGSHIBYAONTRFVLKWESCZDPX
0:00:05.502753
ESIARNTOLCDEUPMGSHIBYAONTRFVLKWESCZIDPX
0:00:05.504382
ESIARNTOLCDEUPMGSHIBYAONTRFVLKWESCZIDPXM
0:00:03.345941
ESIARNTOLCDEUPMGSHIBYAONTRFVLKWESCZIDPXMU
0:00:02.786702
ESIARNTOLCDEUPMGSHIBYAONTRFVLKWESCZIDPXMUG
0:00:02.258980
ESIARNTOLCDEUPMGSHIBYAONTRFVLKWESCZIDPXMUGQ
0:00:02.181486
ESIARNTOLCDEUPMGSHIBYAONTRFVLKWESCZIDPXMUGQJ
0:00:02.069378
ESIARNTOLCDEUPMGSHIBYAONTRFVLKWESCZIDPXMUGOHQ
0:00:01.678280
ESIARNTOLCDEUPMGSHIBYAONTRFVLKWESCZIDPXMUGOHQJ
0:00:01.758940
ESIARNTOLCDEUPMGSHIBYAONTRFVLKWESCZIDPXMUGONHQJ
0:00:01.015172
ESIARNTOLCDEUPMGSHIBYAONTRFVLKWESCZIDPXMUGONHQJA
0:00:00.782816
ESIARNTOLCDEUPMGSHIBYAONTRFVLKWESCZIDPXMUGONHQJAB
0:00:00.590803
ESIARNTOLCDEUPMGSHIBYAONTRFVLKWESCZIDPXMUGONHQJABF
0:00:00.560564
ESIARNTOLCDEUPMGSHIBYAONTRFVLKWESCZIDPXMUGONHQJABFT
0:00:00.789469
ESIARNTOLCDEUPMGSHIBYAONTRFVLKWESCZIDPXMUGONHQJASBFT
0:00:00.213274
ESIARNTOLCDEUPMGSHIBYAONTRFVLKWESCZIDPXMUGONHQJASBFTR
0:00:00.142143
ESIARNTOLCDEUPMGSHIBYAONTRFVLKWESCZIDPXMUGONHQJASBFTER
0:00:00.117293
ESIARNTOLCDEUPMGSHIBYAONTRFVLKWESCZIDPXMUGONHQJASBFTERL
0:00:00.062241
ESIARNTOLCDEUPMGSHIBYAONTRFVLKWESCZIDPXMUGONHQJASBFTERLY
0:00:00.044667
ESIARNTOLCDEUPMGSHIBYAONTRFVLKWESCZIDPXMUGONHQJASBFTERLIY
0:00:00.032022
ESIARNTOLCDEUPMGSHIBYAONTRFVLKWESCZIDPXMUGONHQJASBFTERLIYK
0:00:00.033368
ESIARNTOLCDEUPMGSHIBYAONTRFVLKWESCZIDPXMUGONHQJASBFTERLIYKC
0:00:00.028749
ESIARNTOLCDEUPMGSHIBYAONTRFVLKWESCZIDPXMUGONHQJASBFTERLIYKCV
0:00:00.028905
ESIARNTOLCDEUPMGSHIBYAONTRFVLKWESCZIDPXMUGONHQJASBFTERLIYKCVW
0:00:00.038960
ESIARNTOLCDEUPMGSHIBYAONTRFVLKWESCZIDPXMUGONHQJASBFTERLIYKCVWZ
0:00:00.032474
ESIARNTOLCDEUPMGSHIBYAONTRFVLKWESCZIDPXMUGONHQJASBFTERLIYKCVSWZ
0:00:00.031382
ESIARNTOLCDEUPMGSHIBYAONTRFVLKWESCZIDPXMUGONHQJASBFTERLIYKCVSWZD
0:00:00.032725
ESIARNTOLCDEUPMGSHIBYAONTRFVLKWESCZIDPXMUGONHQJASBFTERLIYKCVSWZDG
0:00:00.028246
ESIARNTOLCDEUPMGSHIBYAONTRFVLKWESCZIDPXMUGONHQJASBFTERLIYKCVSWZDGO
0:00:00.026071
ESIARNTOLCDEUPMGSHIBYAONTRFVLKWESCZIDPXMUGONHQJASBFTERLIYKCVSWZDGON
0:00:00.023227
ESIARNTOLCDEUPMGSHIBYAONTRFVLKWESCZIDPXMUGONHQJASBFTERLIYKCVSWZDGONM
0:00:00.019831
ESIARNTOLCDEUPMGSHIBYAONTRFVLKWESCZIDPXMUGONHQJASBFTERLIYKCVSWZDGONMP
0:00:00.015209
ESIARNTOLCDEUPMGSHIBYAONTRFVLKWESCZIDPXMUGONHQJASBFTERLIYKCVSWZDGONMPU
0:00:00.011805
ESIARNTOLCDEUPMGSHIBYAONTRFVLKWESCZIDPXMUGONHQJASBFTERLIYKCVSWZDGONMPUE
0:00:00.013564
ESIARNTOLCDEUPMGSHIBYAONTRFVLKWESCZIDPXMUGONHQJASBFTERLIYKCVSWZDGONMPUEI
0:00:00.011815
ESIARNTOLCDEUPMGSHIBYAONTRFVLKWESCZIDPXMUGONHQJASBFTERLIYKCVSWZDGONMPUEIA
0:00:00.005378
ESIARNTOLCDEUPMGSHIBYAONTRFVLKWESCZIDPXMUGONHQJASBFTERLIYKCVSWZDGONMPUEIAB
0:00:00.013570
ESIARNTOLCDEUPMGSHIBYAONTRFVLKWESCZIDPXMUGONHQJASBFTERLIYKCVSWZDGONMPUEIABL
0:00:00.004920
ESIARNTOLCDEUPMGSHIBYAONTRFVLKWESCZIDPXMUGONHQJASBFTERLIYKCVSWZDGONMPUEIABLH
0:00:00.002972
ESIARNTOLCDEUPMGSHIBYAONTRFVLKWESCZIDPXMUGONHQJASBFTERLIYKCVSWZDGONMPUEIABLHT
0:00:00.002923
ESIARNTOLCDEUPMGSHIBYAONTRFVLKWESCZIDPXMUGONHQJASBFTERLIYKCVSWZDGONMPUEIABLHTF
0:00:00.003162
ESIARNTOLCDEUPMGSHIBYAONTRFVLKWESCZIDPXMUGONHQJASBFTERLIYKCVSWZDGONMPUEIABLHTSF
0:00:00.004121
ESIARNTOLCDEUPMGSHIBYAONTRFVLKWESCZIDPXMUGONHQJASBFTERLIYKCVSWZDGONMPUEIABLHTSFR
0:00:00.004218
ESIARNTOLCDEUPMGSHIBYAONTRFVLKWESCZIDPXMUGONHQJASBFTERLIYKCVSWZDGONMPUEIABLHTSFRJ
0:00:00.005604
ESIARNTOLCDEUPMGSHIBYAONTRFVLKWESCZIDPXMUGONHQJASBFTERLIYKCVSWZDGONMPUEIABLHTSFRJC
0:00:00.006316
ESIARNTOLCDEUPMGSHIBYAONTRFVLKWESCZIDPXMUGONHQJASBFTERLIYKCVSWZDGONMPUEIABLHTSFRJCY
0:00:00.012737
ESIARNTOLCDEUPMGSHIBYAONTRFVLKWESCZIDPXMUGONHQJASBFTERLIYKCVSWZDGONMPUEIABLHTSFRJCYN
0:00:00.010365
ESIARNTOLCDEUPMGSHIBYAONTRFVLKWESCZIDPXMUGONHQJASBFTERLIYKCVSWZDGONMPUEIABLHTSFRJCYNO
0:00:00.011111
ESIARNTOLCDEUPMGSHIBYAONTRFVLKWESCZIDPXMUGONHQJASBFTERLIYKCVSWZDGONMPUEIABLHTSFRJCYNOI
0:00:00.013791
ESIARNTOLCDEUPMGSHIBYAONTRFVLKWESCZIDPXMUGONHQJASBFTERLIYKCVSWZDGONMPUEIABLHTSFRJCYNOIG
0:00:00.016119
ESIARNTOLCDEUPMGSHIBYAONTRFVLKWESCZIDPXMUGONHQJASBFTERLIYKCVSWZDGONMPUEIABLHTSFRJCYNOIGD
0:00:00.017771
ESIARNTOLCDEUPMGSHIBYAONTRFVLKWESCZIDPXMUGONHQJASBFTERLIYKCVSWZDGONMPUEIABLHTSFRJCYNOIGDX
0:00:00.015684
ESIARNTOLCDEUPMGSHIBYAONTRFVLKWESCZIDPXMUGONHQJASBFTERLIYKCVSWZDGONMPUEIABLHTSFRJCYNOIGDXW
0:00:00.014175
ESIARNTOLCDEUPMGSHIBYAONTRFVLKWESCZIDPXMUGONHQJASBFTERLIYKCVSWZDGONMPUEIABLHTSFRJCYNOIGDXZW
0:00:00.015253
ESIARNTOLCDEUPMGSHIBYAONTRFVLKWESCZIDPXMUGONHQJASBFTERLIYKCVSWZDGONMPUEIABLHTSFRJCYNOIGDXZWU
0:00:00.009111
ESIARNTOLCDEUPMGSHIBYAONTRFVLKWESCZIDPXMUGONHQJASBFTERLIYKCVSWZDGONMPUEIABLHTSFRJCYNOIGDXZWEU
0:00:00.009990
ESIARNTOLCDEUPMGSHIBYAONTRFVLKWESCZIDPXMUGONHQJASBFTERLIYKCVSWZDGONMPUEIABLHTSFRJCYNOIGDXZWEUK
0:00:00.013182
ESIARNTOLCDEUPMGSHIBYAONTRFVLKWESCZIDPXMUGONHQJASBFTERLIYKCVSWZDGONMPUEIABLHTSFRJCYNOIGDXZWEUKQ
0:00:00.003639
ESIARNTOLCDEUPMGSHIBYAONTRFVLKWESCZIDPXMUGONHQJASBFTERLIYKCVSWZDGONMPUEIABLHTSFRJCYNOIGDXZWEUKQV
0:00:00.002877
ESIARNTOLCDEUPMGSHIBYAONTRFVLKWESCZIDPXMUGONHQJASBFTERLIYKCVSWZDGONMPUEIABLHTSFRJCYNOIGDXZWEUKSQV
0:00:00.001968
ESIARNTOLCDEUPMGSHIBYAONTRFVLKWESCZIDPXMUGONHQJASBFTERLIYKCVSWZDGONMPUEIABLHTSFRJCYNOIGDXZWEUKSQVM
0:00:00.001155
ESIARNTOLCDEUPMGSHIBYAONTRFVLKWESCZIDPXMUGONHQJASBFTERLIYKCVSWZDGONMPUEIABLHTSFRJCYNOIGDXZWEUKSQVMA
0:00:00.000890
ESIARNTOLCDEUPMGSHIBYAONTRFVLKWESCZIDPXMUGONHQJASBFTERLIYKCVSWZDGONMPUEIABLHTSFRJCYNOIGDXZWEUKSQVMAP
0:00:00.000592
ESIARNTOLCDEUPMGSHIBYAONTRFVLKWESCZIDPXMUGONHQJASBFTERLIYKCVSWZDGONMPUEIABLHTSFRJCYNOIGDXZWEUKSQVMAPF
0:00:00.000894
ESIARNTOLCDEUPMGSHIBYAONTRFVLKWESCZIDPXMUGONHQJASBFTERLIYKCVSWZDGONMPUEIABLHTSFRJCYNOIGDXZWEUKSQVMAPFB
0:00:00.000512
ESIARNTOLCDEUPMGSHIBYAONTRFVLKWESCZIDPXMUGONHQJASBFTERLIYKCVSWZDGONMPUEIABLHTSFRJCYNOIGDXZWEUKSQVMAPFTB
0:00:00.000452
ESIARNTOLCDEUPMGSHIBYAONTRFVLKWESCZIDPXMUGONHQJASBFTERLIYKCVSWZDGONMPUEIABLHTSFRJCYNOIGDXZWEUKSQVMAPFTBE
0:00:00.000365
ESIARNTOLCDEUPMGSHIBYAONTRFVLKWESCZIDPXMUGONHQJASBFTERLIYKCVSWZDGONMPUEIABLHTSFRJCYNOIGDXZWEUKSQVMAPFTBEI
0:00:00.000277
ESIARNTOLCDEUPMGSHIBYAONTRFVLKWESCZIDPXMUGONHQJASBFTERLIYKCVSWZDGONMPUEIABLHTSFRJCYNOIGDXZWEUKSQVMAPFTBEIZ
0:00:00.000368
ESIARNTOLCDEUPMGSHIBYAONTRFVLKWESCZIDPXMUGONHQJASBFTERLIYKCVSWZDGONMPUEIABLHTSFRJCYNOIGDXZWEUKSQVMAPFTBEIZK
0:00:00.000321
ESIARNTOLCDEUPMGSHIBYAONTRFVLKWESCZIDPXMUGONHQJASBFTERLIYKCVSWZDGONMPUEIABLHTSFRJCYNOIGDXZWEUKSQVMAPFTBEIZKL
0:00:00.000275

And here is the code

import re
import os
import datetime
path = "enable.txt"
words = []
with open(path) as f:
    for values in f:
        words.append(values.strip().upper())

key_count = {}
for word in words:
    seen = {}
    for letter in word:
        if letter not in seen:
            seen[letter] = 0
        key = (letter, seen[letter])
        if key not in key_count:
            key_count[key] = 1
        else:
            key_count[key] += 1
        seen[letter] += 1


KEYS = sorted(key_count.keys(), key=lambda key: -key_count[key])
#print(KEYS)
#print(len(KEYS))
KEY_POS = {}
for i in range(len(KEYS)):
    KEY_POS[KEYS[i]] = i
    KEY_POS[KEYS[i]] = i

# Now we will build a trie.  Every node has a list of words, and a dictionary
# from the next letter farther in the trie.
# BUT TRICK:, we will map each word to a sequence of numbers, and those numbers
# will be indexes into KEYS.  This allows us to use the fact that a second 'e' is
# unlikely to store that efficiently.
class Trie:
    def __init__(self, path):
        self.words = []
        self.dict = {}
        self.words_len = 0
        self.count_words = 0
        self.path = path

    def add_word (self, word):
        trie = self

        poses = []
        seen = {}
        for letter in word:
            if letter not in seen:
                seen[letter] = 0
            key = (letter, seen[letter])
            poses.append(KEY_POS[(key)])
            seen[letter] += 1
        sorted_poses = sorted(poses);
        for i in range(len(sorted_poses)):
            trie.count_words += 1
            pos = sorted_poses[i]
            if pos not in trie.dict:
                trie.dict[pos] = Trie(trie.path + KEYS[pos][0])
            trie = trie.dict[pos]
        trie.count_words += 1
        trie.words_len += 1
        trie.words.append(word)

    def remove (self, pos):
        trie = Trie(self.path)
        trie.words = None
        trie.dict = self.dict.copy()
        trie.words_len = self.words_len
        trie.count_words = self.count_words
        trie.path = self.path
        if pos in trie.dict:
            trie.count_words -= trie.dict[pos].count_words
            trie.dict.pop(pos)
        return trie

    def merge (self, other):
        trie = Trie(self.path)
        trie.words = None
        trie.words_len = self.words_len + other.words_len
        trie.count_words = self.count_words + other.count_words
        trie.path = self.path
        trie.dict = self.dict.copy()
        for pos, subtrie in other.dict.items():
            if pos in trie.dict:
                trie.dict[pos] = trie.dict[pos].merge(subtrie)
            else:
                trie.dict[pos] = subtrie
        return trie

    def __repr__ (self):
        answer = self.path + ":\n  words_len: " + str(self.words_len) + "\n  count_words:" + str(self.count_words) + " \n  dict:\n"
        def indent (string):
            p = re.compile("^", re.M)
            return p.sub("    ", string)
        for pos in sorted(self.dict.keys()):
            subtrie = self.dict[pos]
            answer = answer + indent(str(pos) + ":\n" + subtrie.__repr__())
        return answer

base_trie = Trie('')
for word in words:
    base_trie.add_word(word)

base_tries = [base_trie]
last_trie = base_trie
for i in range(len(KEYS)):
    subtrie = last_trie.dict[i]
    last_trie = last_trie.remove(i).merge(subtrie)
    base_tries.append(last_trie)

#print(base_tries[1].__repr__())

def best_solution (size):
    def solve (subset, pos, best, partial):
        found = sum(x[0] for x in partial)
        upper_bound = sum(x[1] for x in partial)
        if size <= len(subset) or upper_bound < best or len(KEYS) <= pos:
            return (found, subset)
        if best < found:
            best = found

        # Figure out our next calculations.
        partial_include = []
        partial_exclude = []
        finalized_found = 0
        for this_found, this_bound, this_trie in partial:
            if this_trie is None:
                # This is a generic record of already emptied tries
                finalized_found += this_found
            elif pos in this_trie.dict:
                include_trie = this_trie.dict[pos]
                partial_include.append((
                    this_found + include_trie.words_len,
                    include_trie.count_words + this_found,
                    include_trie
                ))
                # We included the tally of found words in the previous partial.
                # So do not double-count by including it again
                partial_include.append((
                    0,
                    this_bound - include_trie.count_words - this_found,
                    this_trie
                ))
                partial_exclude.append((
                    this_found,
                    this_bound - include_trie.count_words,
                    this_trie
                ))
            elif this_found == this_bound:
                finalized_found += this_found
            else:
                partial_include.append((
                    this_found,
                    this_bound,
                    this_trie
                ))

                partial_exclude.append((
                    this_found,
                    this_bound,
                    this_trie
                ))
        if 0 < finalized_found:
            partial_include.append(
                (finalized_found, finalized_found, None)
            )

            partial_exclude.append(
                (finalized_found, finalized_found, None)
            )

        found_include, subset_include = solve(subset + [pos], pos+1, best, partial_include)
        if best < found_include:
            best = found_include
        found_exclude, subset_exclude = solve(subset, pos+1, best, partial_exclude)
        if found_include < found_exclude:
            return (found_exclude, subset_exclude)
        else:
            return (found_include, subset_include)


    best_count = 0
    best_subset = []
    start_subset = [i for i in range(size)]
    while 0 < len(start_subset):
        trie = base_tries[len(start_subset)].remove(start_subset[-1]+1)
        count, subset = solve(list(start_subset), start_subset[-1]+2, best_count, [(trie.words_len, trie.count_words, trie)])
        if best_count < count:
            best_count = count
            best_subset = subset
        start_subset.pop()
    return ''.join([KEYS[x][0] for x in best_subset])

for i in range(len(KEYS):
    start = datetime.datetime.now()
    print(best_solution(i))
    print(datetime.datetime.now() - start)
btilly
  • 43,296
  • 3
  • 59
  • 88
  • This one also correctly omits a result for just one character :) Amazing work! – גלעד ברקן Oct 12 '20 at 04:17
  • I added [another answer](https://stackoverflow.com/a/64312285/2034787) that does consider duplicates. The greedy branches seem so skewed so that almost all the early candidates seem to be correct and are arrived at very fast. But I hacked the program's way to a solution by terminating early, literally. If you have any inclination, I'd welcome a thought about whether this line of development could be expanded to better prune. (I know, I should probably first study your answer :) – גלעד ברקן Oct 12 '20 at 05:54
  • @גלעדברקן I believe that better pruning could definitely still pay off. I suspect that my approach would be faster if we do a depth-first prune of the trie with some smart caching to avoid having to recalculate prunes we do regularly. The principle being that it is worth surprisingly large amounts of up front work to reduce the exponent in exponential growth. – btilly Oct 12 '20 at 23:07
1

The goal should be to make your algorithm to run on the order of O(enablelist_size). That is we don't want to waste time generating words that are not in the list, that way you can ensure that your algorithm is efficient.


  1. The first thing you probably should do is to divide the list by word counts.

    So take your enable.txt file and divide it into files that contain all 2-letter words, 3-letter words, etc. Maybe call them enable_2.txt, enable_3.txt, etc.

    length_map = collections.defaultdict(list)
    with open("enable.txt", "r") as f:
      for word in f:
        length_map[len(word)].append(word)
    
    for n in length_map:
      with open(f"enable_{n}.txt", "w+") as f:
          f.write("\n".join(length_map[c]))
    
    

    You don't have to do this, but it really helps to speed up the next tasks

  2. Now if you have n tiles, you want to read all the enable_n.txt files which contain 2 to n - letter words.

    If you didn't do the first part, then just read the enable.txt file you have.

    For each word in the file, you need to check if your tiles contains all the letters in that word.

    This is a python function you can use to check this:

    import collections
    
    def word_in_tiles(word, tiles):
      tiles_counter = collections.Counter(tiles)
      return all(tiles_counter.get(ch, 0) == cnt for ch,cnt in collections.Counter(word).items())
    

    If the function returns True for any word, you can print that word.

This is method would be much faster than generating all possible combinations of tiles.

In the worst case, this method would be as fast as if you somehow generated all valid words which were already in enable.txt and then filter out the ones which didn't have the correct length.

smac89
  • 39,374
  • 15
  • 132
  • 179
1

Here's a method that returns the results btilly listed in their new answer, but limits the execution time for each result to 20 seconds (except for length 22, and there may be a couple of other outliers) :) ^^

The method uses bitsets and prioritises letter frequency in the word lists (hashed by a 201-bit bitset of letters to include duplicate letters), partitioning the relevant word list by including only word hashes that do not have the new excluded letter.

This is a hack to exit the recursion early because the greedy branches seems so skewed. I'd welcome ways to leverage the ideas here (studying btilly and David Eisenstat's answers more carefully notwithstanding).

from collections import defaultdict
import time

def as_number(word):
    word = word.lower();
    n = 0
    for c in word:
        m = ord(c) - 97
        if n & (1 << m):
            return 0
        else:
            n |= 1 << m
    return n

def as_number_with_duplicates(word):
    counts = defaultdict(lambda: 0)
    word = word.lower();
    n = 0
    for c in word:
        m = ord(c) - 97
        n |= 1 << (counts[c] * 26 + m)
        counts[c] += 1
    return n

def get_letters(n):
    letters = ""
    i = 0
    while n:
        if n & 1:
            letters += chr(97 + i % 26)
        n >>= 1
        i += 1
    return letters

def add_to_freq(freq, bits):
    i = 0
    while bits:
        if bits & 1:
            freq[i] += 1
        bits >>= 1
        i += 1

def arrange_freq(freq):
    return sorted([(x, i) for i, x in enumerate(freq)])

def get_tuple(tpl, i, k, seen, num_to_count_map):
    bits = tpl[2]["bits"]
    freq = [0] * len(tpl[2]["freq"])
    nums = tpl[2]["nums"]

    bit = 1 << i
    new_bits = bits ^ bit

    if new_bits in seen:
        return None

    seen.add(new_bits)

    new_nums = []
    count = 0
    for num in nums:
        if not (num & bit):
            new_nums.append(num)
            count += num_to_count_map[num]
            add_to_freq(freq, num)
    return (count, tpl[1] - 1, {
        "bits": new_bits,
        "freq": arrange_freq(freq),
        "nums": new_nums
    })

def upper_index(a, val, left, right):
    if left == right:
        return left
    mid = (left + right + 1) // 2

    if a[mid][0] > val:
        return upper_index(a, val, left, mid-1)
    else:
        return upper_index(a, val, mid, right)

# tpl = (word_count, letter_count, {bits, freq, nums})
def solve(tpl, k, seen, num_to_count_map, global_best):
    if tpl[1] == k:
        print "cand: %s, %s, %s" % (get_letters(tpl[2]["bits"]), tpl[1], tpl[0])
        if tpl[0] > global_best["count"]:
            global_best["count"] = tpl[0]
            global_best["bits"] = tpl[2]["bits"]

    if tpl[1] > k:
        freq = tpl[2]["freq"]
        start = 1 + upper_index(freq, 0, 0, len(freq))
        for idx in xrange(start, len(freq)):
            if time.time() > global_best["end"]:
                return

            _freq, i = freq[idx]
            new_tpl = get_tuple(tpl, i, k, seen, num_to_count_map)

            # Prune
            if not new_tpl or new_tpl[0] < global_best["count"]:
                continue

            solve(new_tpl, k, seen, num_to_count_map, global_best)

def f(words, k):
    num_to_count_map = defaultdict(lambda: 0)
    nums = set()
    all_bits = 0
    freq = []

    for w in words:
        if len(w) > k:
            continue
        num = as_number_with_duplicates(w)
        nums.add(num)
        num_bits = len(bin(num)[2:])
        if len(freq) < num_bits:
            freq.extend([0] * (num_bits - len(freq)))
        add_to_freq(freq, num)
        num_to_count_map[num] += 1
        all_bits |= num

    tpl = (len(words), bin(all_bits).count("1"), {"bits": all_bits, "freq": arrange_freq(freq), "nums": nums})

    global_best = {
        "count": -float('inf'),
        "bits": 0,
        "end": time.time() + (5*60 if k == 22 else 20)
    }

    solve(tpl, k, set([all_bits]), num_to_count_map, global_best)

    return global_best

def format(w):
    return "".join(sorted(w.lower()))

def are_equal(w1, w2):
    return format(w1) == format(w2)

"""
import os
import datetime
path = "enable.txt"
words = []
with open(path) as file:
    for values in file:
        words.append(values.strip())

start = datetime.datetime.now()
for i in xrange(2, 31):
    result = f(words, i)
    formatted = format(get_letters(result["bits"]))
    print result["count"], formatted
    print are_equal(formatted, btilly[i])
print(datetime.datetime.now() - start)
"""

# Code by btilly

import re
import os
import datetime
path = "enable.txt"
words = []
with open(path) as file:
    for values in file:
        words.append(values.strip().lower())

key_count = {}
for word in words:
    seen = {}
    for letter in word:
        if letter not in seen:
            seen[letter] = 0
        key = (letter, seen[letter])
        if key not in key_count:
            key_count[key] = 1
        else:
            key_count[key] += 1
        seen[letter] += 1


KEYS = sorted(key_count.keys(), key=lambda key: -key_count[key])
#print(KEYS)
#print(len(KEYS))
KEY_POS = {}
for i in range(len(KEYS)):
    KEY_POS[KEYS[i]] = i
    KEY_POS[KEYS[i]] = i

# Now we will build a trie.  Every node has a list of words, and a dictionary
# from the next letter farther in the trie.
# BUT TRICK:, we will map each word to a sequence of numbers, and those numbers
# will be indexes into KEYS.  This allows us to use the fact that a second 'e' is
# unlikely to store that efficiently.
class Trie:
    def __init__(self, path):
        self.words = []
        self.dict = {}
        self.words_len = 0
        self.count_words = 0
        self.path = path

    def add_word (self, word):
        trie = self

        poses = []
        seen = {}
        for letter in word:
            if letter not in seen:
                seen[letter] = 0
            key = (letter, seen[letter])
            poses.append(KEY_POS[(key)])
            seen[letter] += 1
        sorted_poses = sorted(poses);
        for i in range(len(sorted_poses)):
            trie.count_words += 1
            pos = sorted_poses[i]
            if pos not in trie.dict:
                trie.dict[pos] = Trie(trie.path + KEYS[pos][0])
            trie = trie.dict[pos]
        trie.count_words += 1
        trie.words_len += 1
        trie.words.append(word)

    def remove (self, pos):
        trie = Trie(self.path)
        trie.words = None
        trie.dict = self.dict.copy()
        trie.words_len = self.words_len
        trie.count_words = self.count_words
        trie.path = self.path
        if pos in trie.dict:
            trie.count_words -= trie.dict[pos].count_words
            trie.dict.pop(pos)
        return trie

    def merge (self, other):
        trie = Trie(self.path)
        trie.words = None
        trie.words_len = self.words_len + other.words_len
        trie.count_words = self.count_words + other.count_words
        trie.path = self.path
        trie.dict = self.dict.copy()
        for pos, subtrie in other.dict.items():
            if pos in trie.dict:
                trie.dict[pos] = trie.dict[pos].merge(subtrie)
            else:
                trie.dict[pos] = subtrie
        return trie

    def __repr__ (self):
        answer = self.path + ":\n  words_len: " + str(self.words_len) + "\n  count_words:" + str(self.count_words) + " \n  dict:\n"
        def indent (string):
            p = re.compile("^", re.M)
            return p.sub("    ", string)
        for pos in sorted(self.dict.keys()):
            subtrie = self.dict[pos]
            answer = answer + indent(str(pos) + ":\n" + subtrie.__repr__())
        return answer

base_trie = Trie('')
for word in words:
    base_trie.add_word(word)

base_tries = [base_trie]
last_trie = base_trie
for i in range(len(KEYS)):
    subtrie = last_trie.dict[i]
    last_trie = last_trie.remove(i).merge(subtrie)
    base_tries.append(last_trie)

#print(base_tries[1].__repr__())

def best_solution (size):
    def solve (subset, pos, best, partial):
        found = sum(x[0] for x in partial)
        upper_bound = sum(x[1] for x in partial)
        if size <= len(subset) or upper_bound < best or len(KEYS) <= pos:
            return (found, subset)
        if best < found:
            best = found

        # Figure out our next calculations.
        partial_include = []
        partial_exclude = []
        finalized_found = 0
        for this_found, this_bound, this_trie in partial:
            if this_trie is None:
                # This is a generic record of already emptied tries
                finalized_found += this_found
            elif pos in this_trie.dict:
                include_trie = this_trie.dict[pos]
                partial_include.append((
                    this_found + include_trie.words_len,
                    include_trie.count_words + this_found,
                    include_trie
                ))
                # We included the tally of found words in the previous partial.
                # So do not double-count by including it again
                partial_include.append((
                    0,
                    this_bound - include_trie.count_words - this_found,
                    this_trie
                ))
                partial_exclude.append((
                    this_found,
                    this_bound - include_trie.count_words,
                    this_trie
                ))
            elif this_found == this_bound:
                finalized_found += this_found
            else:
                partial_include.append((
                    this_found,
                    this_bound,
                    this_trie
                ))

                partial_exclude.append((
                    this_found,
                    this_bound,
                    this_trie
                ))
        if 0 < finalized_found:
            partial_include.append(
                (finalized_found, finalized_found, None)
            )

            partial_exclude.append(
                (finalized_found, finalized_found, None)
            )

        found_include, subset_include = solve(subset + [pos], pos+1, best, partial_include)
        if best < found_include:
            best = found_include
        found_exclude, subset_exclude = solve(subset, pos+1, best, partial_exclude)
        if found_include < found_exclude:
            return (found_exclude, subset_exclude)
        else:
            return (found_include, subset_include)


    best_count = 0
    best_subset = []
    start_subset = [i for i in range(size)]
    while 0 < len(start_subset):
        trie = base_tries[len(start_subset)].remove(start_subset[-1]+1)
        count, subset = solve(list(start_subset), start_subset[-1]+2, best_count, [(trie.words_len, trie.count_words, trie)])
        if best_count < count:
            best_count = count
            best_subset = subset
        start_subset.pop()
    return ''.join([KEYS[x][0] for x in best_subset])

# End code by btilly

for i in range(25, len(KEYS)):
    start = datetime.datetime.now()
    btilly = best_solution(i)
    print btilly
    print(datetime.datetime.now() - start)
    start = datetime.datetime.now()
    result = f(words, i)
    formatted = format(get_letters(result["bits"]))
    print formatted
    print(datetime.datetime.now() - start)
    print are_equal(btilly, formatted)
    print ""
גלעד ברקן
  • 23,602
  • 3
  • 25
  • 61
0

If you do selection with replacement, you will always get the same number of combinations, but more unique combinations using unique characters.

  • AA: AA,AA,A,A
  • AB: AB,BA,A,B

As this will get very large, very quickly, rather than statically creating all of the combinations, you could create a generator which will yield all of your values as-needed (as the input to some other calculation).

import string
import itertools

def gen_substrs_with_replacement(maxlen, max_chars=26):
    letters = string.ascii_lowercase[:max_chars]
    for length in range(1, maxlen + 1):  # start at 1 and end at N
        yield from itertools.combinations_with_replacement(letters, length)

However, if you just want the count of values, you can calculate 'em with the binomial formula with replacement.

binom with replacement

Note that if you want the substrings too, you'll actually want the sum of the series of it with less selections each time (ABCD: AAA AA A).

from math import comb as nchoosek

# N nhoose K with replacement and every sub-list
def nCk_r(source_length, selections, sublist=False):
    if sublist and selections > 1:
        return nchoosek(source_length + selections - 1, selections) + nCk_r(source_length, selections - 1, True)
    return nchoosek(source_length + selections -1, selections)
>>> list(gen_substrs_with_replacement(2, 3))
[('a',), ('b',), ('c',), ('a', 'a'), ('a', 'b'), ('a', 'c'), ('b', 'b'), ('b', 'c'), ('c', 'c')]
>>> len(list(gen_substrs_with_replacement(2, 3)))
9
>>> nCk_r(2,3,True)
9
>>> len(list(gen_substrs_with_replacement(12, 10)))
646645
>>> nCk_r(12,10,True)
646645
ti7
  • 16,375
  • 6
  • 40
  • 68
  • I think I miscommunicated my problem. It seems like you're just computing the total number of possible unique combinations for a given subset of letters. What I was asking was an algorithm which returns which set of letters could be used to spell the most english-valid words without replacement for a given N. So I just specify the number of letters (e.g. 3), and the algorithm returns which set of letters allows you to construct the most english-valid words from that set. – Parseltongue Oct 05 '20 at 19:59
  • @Parseltongue hmm.. must the words be valid per some dictionary or simply collections of letters which can be created with some Latin charset? – ti7 Oct 05 '20 at 20:06
  • They have to be valid as per some dictionary. I'm using the enable.txt wordlist: https://www.wordgamedictionary.com/enable/ – Parseltongue Oct 05 '20 at 20:09
  • 1
    Ahah - in which case, I think that as at minimum you'll need to check if a word is valid, parsing each word in the wordlist (perhaps to a `set`), storing it in some collection, and basing your logic from that makes more sense than trying to brute force the characters! This certainly doesn't make your question invalid, but quite different to what I had thought! – ti7 Oct 05 '20 at 20:17