I think an important part of this problem is how you interpret the clause "highest similarity with other words in dictionary" in the problem description. If you take that to mean the highest sum of similarities with all the other words, you can compute that sum in a more efficient way than your nested loops.
Rather than directly comparing all words, you can instead count how many different words contain each letter. Then in a second pass, you can add up the number of other words that share each of a word's letters, which is almost the sum of similarities we want. We just need to subtract out the word's similarity with itself (which is the number of unique letters it contains), since we're only supposed to compare to other words.
Here's a function that computes a list of similarities in the same order the words are in:
from collections import Counter
def compute_similarities(words):
letter_count = Counter()
for word in words: # first pass
letter_count.update(set(word)) # count each letter
similarities = []
for word in words: # second pass
letters = set(word)
similarities.append(sum(letter_count[letter] for letter in letters)
- len(letters)) # correct for self similarity
return similarities
Here's an example run, with the names of the months as our dictionary:
>>> months = [
"january", "february", "march", "april",
"may", "june", "july", "august",
"september", "october", "november", "december",
]
>>> compute_similarities(months)
[23, 28, 18, 14, 12, 13, 10, 12, 24, 21, 23, 22]
So, it looks like February is the most similar to the other months. To validate that this is counting correctly (given my assumption about what "similarity with other words" means for each word), we can compare its results with another version, based on your similarity function and nested loops (which works just fine for short word lists).
def similar(a, b):
x = list(set(a)&set(b))
return len(x)
def compute_similarities_bruteforce(words):
similarities = []
for i in words:
total = 0
for j in words:
if(i!=j):
total += similar(i,j)
similarities.append(total)
return similarities
Test run:
>>> compute_similarities_bruteforce(months)
[23, 28, 18, 14, 12, 13, 10, 12, 24, 21, 23, 22]
After downloading the word list you're supposed to use (and very lightly pre-processing it, e.g. making everything lowercase), I found these to be the most similar words in the list:
>>> similarities = compute_similarities(words)
>>> simiarity_sorted_words = sorted(zip(similarities, words), reverse=True)
>>> simiarity_sorted_words[:20]
[(3059877, 'pseudolamellibranchiate'),
(3059877, 'pseudolamellibranchiata'),
(2973121, 'pseudoconglomeration'),
(2966493, 'pseudoanthropological'),
(2956212, 'pseudoromantically'),
(2949584, 'spondylotherapeutics'),
(2949584, 'pseudophilanthropically'),
(2949584, 'pseudohallucinatory'),
(2946269, 'neuropharmacologist'),
(2932039, 'salpingo-oophorectomy'),
(2929360, 'hyperconstitutionalism'),
(2925092, 'encephalomyocarditis'),
(2887146, 'sphygmomanometrically'),
(2887146, 'semianthropologically'),
(2887146, 'oophorosalpingectomy'),
(2884111, 'pseudoconservatively'),
(2880336, 'pneumatico-hydraulic'),
(2875526, 'quasi-complimentary'),
(2874192, 'cloth-measuring'),
(2873602, 'cloth-spreading')]
The ties are generally sets of words that contain exactly the same letters (which aren't always obvious, since there are often quite a lot of duplicated letters). You might get slightly different results if you preprocess the word list differently (e.g. keeping case sensitivity, or filtering out non-letter characters like hyphens and apostrophes).