2

I have a file that contains just over 100,000 words. What I have to do is run through every 5 letter combination of the alphabet to work out the 5 letters that are used by the least number of words.

I've worked out a python based program that will eventually get the answer, but at the speed it is going it would probably take around 48 hours if not longer. Part the problem is the sheer number of calculations. I haven't worked out how to limit permutations so that only distinct strings are compared - so that is 265 calculations just on the combinations alone, and then each of these is compared against 100,000 words, which works out at around 10*1011 calculations at a bare minimum.

Is there any way to speed this process up substantially, be that through more efficient algorithms, or multi-threading or something like that?

Any recommendations for books/articles on algorithm efficiency would also be greatly appreciated.


My current program is as follows:

Imports permutation function from itertools module:

from itertools import permutations

Asks if the word contains the forbidden letters:

def avoids (word, letters): 
    for characters in letters:
        for character in word:
            if character == characters:
                return False
    return True     

Calculates the number of words from the file that do not contain the forbidden characters:

def number_of_words(file, letters):  

    open_file = open(file)

    x = 0 #counter
    for line in open_file:
        word = line.strip()
        if avoids(word, letters) == True:
        x += 1  
    return x

Runs through every variation of five letters that exists in the alphabet and calculates the combination that excludes the fewest words:

def find_smallest():

    y = 0

    #every combination of letters
    for letters in permutations("abcdefghijklmnopqrstuvwxyz", 5): 
        x = number_of_words("words.txt", letters)
        #sets y to the initial value of x
        if y == 0:
            y = x
            print "Start point, combination: %s, amount: %d" % (letters, y)

        #If current combination is greater than all previous combinations set y to x
        elif x > y:
            y = x
            combination = letters
            duplication = 0
            print "New highest, combination: %s, amount: %d" % (letters, y)

        print "%s excludes the smallest number of words (%d)" % (combination, y)

Runs the program:

find_smallest()
advert2013
  • 9,734
  • 3
  • 24
  • 35
  • 1
    Are you looking for Permutations or Combinations? Also, do you reload your 100000-words-file in each iteration of the permutations/combinations loop? – tobias_k Feb 19 '14 at 15:56
  • This question appears to be off-topic because it belongs on http://codereview.stackexchange.com – jonrsharpe Feb 19 '14 at 15:59
  • 1
    Are you looking for 5 least used letters or the least used 5-letter sequence? – bereal Feb 19 '14 at 16:05
  • 1
    By "the 5 letters that are used by the least number of words" you could mean at least two things: the five most infrequently used letters ever (imagine there was just one really long word), or the five-set of letters used the least frequently in *any* word. Which are you thinking of? [The latter probably isn't unique, for example.] – DSM Feb 19 '14 at 16:05
  • sorry, yes least used letters in any word, not sequence. I was not aware of code review, but I will post there next time. – advert2013 Feb 19 '14 at 16:11

2 Answers2

4
  1. you can use combinations instead of permutations
  2. why not just scan all words once, count the number of appearances of each letter, then select the 5 with minimum number of appearances ?
yurib
  • 8,043
  • 3
  • 30
  • 55
  • Might need OP to clarify if multiple letters in one word should not count towards its frequency total. i.e. with the set `{'there','hat'}` frequency of 'e' is only 1? – C.B. Feb 19 '14 at 16:06
  • up to him to count in whichever way suits his needs, the same approach still stands whether he counts each occurrence or all occurrences in a word as 1 – yurib Feb 19 '14 at 16:09
  • I will read up on combinations which seems to be exactly what I was after - even though you are totally right with 2., so I don't actually need either. – advert2013 Feb 19 '14 at 16:14
  • 1
    @C.B. Easy fix if that's what's wanted - converting the word to a `Set` and then to a `list` will uniq-ify the set of letters. – pjs Feb 19 '14 at 16:31
3

This is not a question about increasing permutation efficiency. This is really a question about how to make a smarter algorithm, it is a data structure question.

I have a file that contains just over 100,000 words. What I have to do is run through every 5 letter combination of the alphabet to work out the 5 letters that are used by the least number of words.

Loop through the 26 letters of the alphabet and count the number of words in your list which use each letter:

import string
alphabet = string.ascii_lowercase
counts = {k: sum(k in word.lower() for word in words) for k in alphabet}

This should be pretty fast, and should give you enough information to trivially pick out the five least popular letters.

Equivalent approach, probably more efficient but maybe less clear than the above.

from itertools import chain
from collections import Counter
counter = Counter({k: 0 for k in string.ascii_lowercase})
counter.update(Counter(c for w in words for c in set(w.lower())))
wim
  • 338,267
  • 99
  • 616
  • 750
  • Yes this is what I was after, smarter algorithm is exactly right. Just tried it out, much much faster. – advert2013 Feb 19 '14 at 16:12
  • I'd get rid of the chaining, as it first creates a huge tuple with a set for every word, and because it's not really readable and replace it with a plain generator statement: `Counter(c for w in words for c in set(w.lower()))` – Michael Feb 19 '14 at 16:25
  • Yeah the nested comprehension is much better, itertools was sucky and unpythonic as usual. You can edit it inplace please. – wim Feb 19 '14 at 16:27