I have a file that contains just over 100,000 words. What I have to do is run through every 5 letter combination of the alphabet to work out the 5 letters that are used by the least number of words.
I've worked out a python based program that will eventually get the answer, but at the speed it is going it would probably take around 48 hours if not longer. Part the problem is the sheer number of calculations. I haven't worked out how to limit permutations so that only distinct strings are compared - so that is 265 calculations just on the combinations alone, and then each of these is compared against 100,000 words, which works out at around 10*1011 calculations at a bare minimum.
Is there any way to speed this process up substantially, be that through more efficient algorithms, or multi-threading or something like that?
Any recommendations for books/articles on algorithm efficiency would also be greatly appreciated.
My current program is as follows:
Imports permutation function from itertools module:
from itertools import permutations
Asks if the word contains the forbidden letters:
def avoids (word, letters):
for characters in letters:
for character in word:
if character == characters:
return False
return True
Calculates the number of words from the file that do not contain the forbidden characters:
def number_of_words(file, letters):
open_file = open(file)
x = 0 #counter
for line in open_file:
word = line.strip()
if avoids(word, letters) == True:
x += 1
return x
Runs through every variation of five letters that exists in the alphabet and calculates the combination that excludes the fewest words:
def find_smallest():
y = 0
#every combination of letters
for letters in permutations("abcdefghijklmnopqrstuvwxyz", 5):
x = number_of_words("words.txt", letters)
#sets y to the initial value of x
if y == 0:
y = x
print "Start point, combination: %s, amount: %d" % (letters, y)
#If current combination is greater than all previous combinations set y to x
elif x > y:
y = x
combination = letters
duplication = 0
print "New highest, combination: %s, amount: %d" % (letters, y)
print "%s excludes the smallest number of words (%d)" % (combination, y)
Runs the program:
find_smallest()