-1

So I have a ginormous list of 466,550 words and I need to compare the words with each other to check the similarity between them. Similarity between two words is defined as the number of common characters between them in my case. But if I try :

for i in words:
  for j in words:
    if(i!=j):
      temp.append(similar(i,j))
  sim.append(max(temp))
  temp = []

It'll run 466,550^2 times and is taking an eternity to get the final output. So is there another way to do the same thing in linear time?

Edit: Similar has a pretty basic definition as follows:

def similar(a, b):
    x = list(set(a)&set(b))
    return len(x)

I looked up a way to compare words and then just returned the total number of similar characters for each comparison

Edit 2:

So here's the whole question for those interested:

John and Billy are two good friends. John wanted to hack the password of Billy's Insta account just for fun. The password is an English word from this dictionary:

https://raw.githubusercontent.com/dwyl/english-words/master/words.txt.

He cannot try out all the words from the dictionary as each try will take at least 30 seconds. But he knows the following clues about password.

  1. Clue 1: The word has highest similarity with other words in dictionary. Similarity between two words is defined by number of common characters.
  2. Clue 2: The word has large number of vowels in it.
  3. Clue 3: The word has large number of characters in it. Create 3 independent word lists ranking words based on each clue. Finally rank each word based on their position (weightage of 0.33 for each clue) in these 3 lists. Come up with top 100 potential words based on final rank.
  • 2
    It depends, what's the code for the `similar()` function? – kcsquared Feb 22 '22 at 05:46
  • you need to trim your list down first ... thats just too many – Joran Beasley Feb 22 '22 at 05:49
  • You don't have to compare `word_a` to `word_b` if you have already checked `word_b` against `word_a`. This optimisation will reduce it from `O(n^2)` to `O(n*log n)`. There is not much you can do regarding the time complexity. You can certainly make it a bit faster by precomputing all `set`s in advance and keeping track of `max` instead of appending all values to a list then finding the max. – Selcuk Feb 22 '22 at 05:55
  • @JoranBeasley I can't ;_; this is an assignment for HPC and we're asked to show a serial implementation of this first before doing parallel computing – Tanishq Kohli Feb 22 '22 at 05:55
  • @Selcuk the main goal of this assignment is to do this using parallel computing so that will reduce the time by a ton. But we're asked to show the serial implementation first. I'll try the first approach to see if it makes a dent – Tanishq Kohli Feb 22 '22 at 06:01
  • " I need to compare the words with each other to check the similarity between them" Okay, so what does "with each other" mean? Do you actually need to compare every word to every other word? How many output values do you want? What *overall task do you want to accomplish* by doing these comparisons? – Karl Knechtel Feb 22 '22 at 06:07
  • 1
    @KarlKnechtel it's basically an HPC assignment. Here's the whole thing if you're interested: We need to crack a pw and we have 3 total clues: 1. The word has highest similarity with other words in dictionary. Similarity between two words is defined by number of common characters. 2. The word has large number of vowels in it. 3. The word has large number of characters in it. The task is to create 3 independent word lists ranking words based on each clue. Then rank each word based on their position (0.33 per clue) in these 3 lists. Then make a list of the top 100 words – Tanishq Kohli Feb 22 '22 at 06:12
  • Are your words composed of lowercase letters only (or some small alphabet)? If you convert each word to a bit-string of whether each possible character is present or not, the most similar word is the nearest neighbor according to the dissimilarity metric (size of alphabet minus similarity). While this is not the same as Hamming distance, the recommendations in [this post](https://stackoverflow.com/q/6389841/16757174) are probably useful here too. – kcsquared Feb 22 '22 at 06:25
  • 2
    Tanishq, you should edit your question to include the content of your last comment. Your only hope of getting help here is to provide as much information as possible. Also please tell on what is in your dictionary, if the password is in the dictionary, how long the password might be, and so on. I hope I'm not helping you crack Aunt Sue's password to her bank account. :-) – Cary Swoveland Feb 22 '22 at 06:55
  • @CarySwoveland Just edited the question and added all the info available. The link to the dictionary is also included. And don't worry, Aunt Sue is safe. I don't have the necessary skills to get into her account even if I tried ;_; – Tanishq Kohli Feb 22 '22 at 07:13
  • @kcsquared I just edited the question and provided all the info available to me – Tanishq Kohli Feb 22 '22 at 07:14
  • 2
    @Selcuk I do not follow how your suggestion gets down to `O(n log(n))`; it seems that the number of comparisons is cut in half, but is still `O(n^2)`. – hilberts_drinking_problem Feb 22 '22 at 07:29
  • How are you supposed to interpret "highest similarity with other words in dictionary"? Is that the sum of all similarities? The median similarity? I'd assume not the maximum similarity (which it looks like you're looping code is using) or you'd always have a tie between the two most-similar words. – Blckknght Feb 22 '22 at 07:31
  • @Blckknght I'm assuming it's the maximum one because even if the max is shared among a lot of words, we don't need to find just 1 word but the top 100. So let's say there are 10,000 words which have the same similarity that is also the maximum. We'll then find out the ones with the maximum number of vowels in them using another function. And at last, we can sort the list by the number of characters in descending order and append the top 100 to the final list. If you have another way of interpreting this question then please share it with me. – Tanishq Kohli Feb 22 '22 at 07:49
  • 1
    If you were using the sum of similarities (that is, add up `similar(i, j)` for all `j` values to get `i`'s score), you can compute it another way that turns out to be more efficient (it's O(N), rather than O(N**2)). I'm working on an answer that shows it. I don't think there's an equivalent optimization for the words with the pairs of words with the maximum similarity. – Blckknght Feb 22 '22 at 07:55

1 Answers1

1

I think an important part of this problem is how you interpret the clause "highest similarity with other words in dictionary" in the problem description. If you take that to mean the highest sum of similarities with all the other words, you can compute that sum in a more efficient way than your nested loops.

Rather than directly comparing all words, you can instead count how many different words contain each letter. Then in a second pass, you can add up the number of other words that share each of a word's letters, which is almost the sum of similarities we want. We just need to subtract out the word's similarity with itself (which is the number of unique letters it contains), since we're only supposed to compare to other words.

Here's a function that computes a list of similarities in the same order the words are in:

from collections import Counter

def compute_similarities(words):
    letter_count = Counter()
    for word in words:                  # first pass
        letter_count.update(set(word))  # count each letter

    similarities = []
    for word in words:                  # second pass
        letters = set(word)
        similarities.append(sum(letter_count[letter] for letter in letters)
                            - len(letters))  # correct for self similarity
    return similarities

Here's an example run, with the names of the months as our dictionary:

>>> months = [
    "january", "february", "march", "april",
    "may", "june", "july", "august",
    "september", "october", "november", "december",
]

>>> compute_similarities(months)
[23, 28, 18, 14, 12, 13, 10, 12, 24, 21, 23, 22]

So, it looks like February is the most similar to the other months. To validate that this is counting correctly (given my assumption about what "similarity with other words" means for each word), we can compare its results with another version, based on your similarity function and nested loops (which works just fine for short word lists).

def similar(a, b):
    x = list(set(a)&set(b))
    return len(x)
    

def compute_similarities_bruteforce(words):
    similarities = []
    for i in words:
        total = 0
        for j in words:
            if(i!=j):
               total += similar(i,j)
        similarities.append(total)
    return similarities

Test run:

>>> compute_similarities_bruteforce(months)
[23, 28, 18, 14, 12, 13, 10, 12, 24, 21, 23, 22]

After downloading the word list you're supposed to use (and very lightly pre-processing it, e.g. making everything lowercase), I found these to be the most similar words in the list:

>>> similarities = compute_similarities(words)

>>> simiarity_sorted_words = sorted(zip(similarities, words), reverse=True)

>>> simiarity_sorted_words[:20]
[(3059877, 'pseudolamellibranchiate'),
 (3059877, 'pseudolamellibranchiata'),
 (2973121, 'pseudoconglomeration'),
 (2966493, 'pseudoanthropological'),
 (2956212, 'pseudoromantically'),
 (2949584, 'spondylotherapeutics'),
 (2949584, 'pseudophilanthropically'),
 (2949584, 'pseudohallucinatory'),
 (2946269, 'neuropharmacologist'),
 (2932039, 'salpingo-oophorectomy'),
 (2929360, 'hyperconstitutionalism'),
 (2925092, 'encephalomyocarditis'),
 (2887146, 'sphygmomanometrically'),
 (2887146, 'semianthropologically'),
 (2887146, 'oophorosalpingectomy'),
 (2884111, 'pseudoconservatively'),
 (2880336, 'pneumatico-hydraulic'),
 (2875526, 'quasi-complimentary'),
 (2874192, 'cloth-measuring'),
 (2873602, 'cloth-spreading')]

The ties are generally sets of words that contain exactly the same letters (which aren't always obvious, since there are often quite a lot of duplicated letters). You might get slightly different results if you preprocess the word list differently (e.g. keeping case sensitivity, or filtering out non-letter characters like hyphens and apostrophes).

Blckknght
  • 100,903
  • 11
  • 120
  • 169