1

I'm trying to get all the words made from the letters, 'crbtfopkgevyqdzsh' from a file called web2.txt. The posted cell below follows a block of code which improperly returned the whole run up to a full word e.g. for the word shocked it would return s, sh, sho, shoc, shock, shocke, shocked

So I tried a trie (know pun intended).

web2.txt is 2.5 MB in size, and contains 2,493,838 words of varying length. The trie in the cell below is breaking my Google Colab notebook. I even upgraded to Google Colab Pro, and then to Google Colab Pro+ to try and accommodate the block of code, but it's still too much. Any more efficient ideas besides trie to get the same result?

# Find the words3 word list here:  svnweb.freebsd.org/base/head/share/dict/web2?view=co

trie = {}

with open('/content/web2.txt') as words3:


    for word in words3:
        cur = trie
        for l in word:
            cur  = cur.setdefault(l, {})
            cur['word'] = True # defined if this node indicates a complete word
        
def findWords(word, trie = trie, cur = '', words3 = []):
    for i, letter in enumerate(word):
        if letter in trie:
            if 'word' in trie[letter]:
                words3.append(cur)
            findWords(word, trie[letter], cur+letter, words3 )    
            # first example: findWords(word[:i] + word[i+1:], trie[letter], cur+letter, word_list )

    return [word for word in words3 if word in words3]

words3 = findWords("crbtfopkgevyqdzsh")

I'm using Pyhton3

peoplet
  • 81
  • 1
  • 7
  • 1
    `[word for word in words3 if word in words3]`: what's that meant to do? – Peter Wood Jun 22 '22 at 23:25
  • 1
    Are you strictly required to use a trie? Also, the expected output is a list of all the words than can be made from the letters in the input string? – BrokenBenchmark Jun 22 '22 at 23:29
  • That's exactly right @BrokenBenchmark, I'm open to any efficient and proper solution that returns a list of all the words that can be made from the letters in the input string – peoplet Jun 22 '22 at 23:35
  • 1
    Also note: You're using a mutable default for `findWords`'s `trie` and `words3` arguments. This will break if you call `findWords` more than once without passing `trie` & `words3` explicitly (it works in the code shown because the recursive calls all pass them explicitly, and they're supposed to be shared in that case anyway). See: ["Least Astonishment" and the Mutable Default Argument](https://stackoverflow.com/q/1132941/364696) – ShadowRanger Jun 22 '22 at 23:37
  • I understand @ShadowRanger. Thanks for the heads up. I'll tackle that after. – peoplet Jun 22 '22 at 23:39
  • 1
    As a suggestion, [`pyahocorasick`](https://pyahocorasick.readthedocs.io/en/latest/index.html) provides a C-optimized trie as part of implementing the efficient multi-string search of the Aho-Corasick string search algorithm; you might save a decent amount of memory by using it instead of implementing it from scratch with Python level primitives. – ShadowRanger Jun 22 '22 at 23:47

1 Answers1

1

A trie is overkill. There's about 200 thousand words, so you can just make one pass through all of them to see if you can form the word using the letters in the base string.

This is a good use case for collections.Counter, which gives us a clean way to get the frequencies (i.e. "counters") of the letters of an arbitrary string:

from collections import Counter

base_counter = Counter("crbtfopkgevyqdzsh")
with open("data.txt") as input_file:
    for line in input_file:
        line = line.rstrip()
        line_counter = Counter(line.lower())
        # Can use <= instead if on Python 3.10
        if line_counter & base_counter == line_counter:
            print(line)
BrokenBenchmark
  • 18,126
  • 7
  • 21
  • 33
  • Do you only want to output Becky in this case? – BrokenBenchmark Jun 22 '22 at 23:51
  • 1
    Wait, I'm looking again, and I think your answer is right! The list seems complete. I got thrown off by the letter b. I can just right a line to get rid of any single letters that aren't A and I to be used alone in a sentence. – peoplet Jun 22 '22 at 23:54