How to efficiently compare large lists in Python?

Question

I am trying to find 9 letter words, that when you split evenly into 3 parts, and jumble around, you get another nine letter word.

for i in nineWordList:
    for j in nineWordList:
        if (i[3:5] + i[0:2] + i[6:8]) == j:
            correctWords.append(i)
        elif (i[3:5] + i[6:8] + i[0:2]) == j:
            correctWords.append(i)
        elif (i[0:2] + i[6:8] + i[3:5]) == j:
            correctWords.append(i)
        elif (i[6:8] + i[0:2] + i[3:5]) == j:
            correctWords.append(i)
        elif (i[6:8] + i[3:5] + i[0:2]) == j:
            correctWords.append(i)

This is how I do it. The only problem is nineWordList is 68,000 elements long, and this takes ages. How can I improve this, to make it more efficient?

If `correctWords` becomes a set, you can replace the `for j` loop by `if something in correctWords`, and it will be fast enough, O(log #entries). — Gassa, Jan 27 '17 at 07:58

score 8 · Accepted Answer · edited Jan 27 '17 at 16:20

Use a set to avoid having to loop on two levels through the list:

nineWordSet = set(nineWordList)
for i in nineWordSet:
    if i[3:5] + i[0:2] + i[6:8] in nineWordSet:
        correctWords.append(i)
    elif i[3:5] + i[6:8] + i[0:2] in nineWordSet:
        correctWords.append(i)
    elif i[0:2] + i[6:8] + i[3:5] in nineWordSet:
        correctWords.append(i)
    elif i[6:8] + i[0:2] + i[3:5] in nineWordSet:
        correctWords.append(i)
    elif i[6:8] + i[3:5] + i[0:2] in nineWordSet:
        correctWords.append(i)

This will still have to loop through all those 68,000 entries (you obviously cannot avoid that) but in a first pass, it will convert the list into a set, so membership tests using in can be made in constant time. This gives you a linear time complexity instead of the quadratic time complexity that your nested loops had. Of course, the additional set will require more memory but that shouldn’t be a problem.

Btw. I believe your slicing is off. i[0:2] will not produce a 3-letter word (when you want to split a 9-letter word evenly):

>>> x = 'abcdefghi'
>>> x[0:2], x[3:5], x[6:8]
('ab', 'de', 'gh')

The second index in slices is always non-inclusive so you need to increase that by one:

>>> x[0:3], x[3:6], x[6:9]
('abc', 'def', 'ghi')

You can also shorten your conditions a bit by using itertools.permutations to generate those possible jumpled words. That way, your checks might be a bit nicer to the eye:

import itertools
nineWordSet = set(nineWordList)

for word in nineWordSet:
    for perm in itertools.permutations((word[0:3], word[3:6], word[6:9])):
        # skip the original permutation
        if perm == word:
            continue

        elif perm in nineWordSet:
            correctWords.append(word)

            # stop checking for more permutations
            break

Thanks! Worked great! Btw though you wrote 'w' instead of 'word' in the for loop — Melkor, Jan 27 '17 at 15:39
Note that the two algorithms possibly produce different results. The second collects all permutations that occur in the original list, except for the original word (which is debatable). Whereas the first algorithm (if...elif...elif...) only records the first match - all further permutations are mistakenly skipped. — user1016274, Jan 27 '17 at 16:34
@user1016274 No, what is being appended to the `correctWords` in my second code example is the original `word`, not the permutation `perm`. And once *any* permutation match has been found, the remaining permutations of that word are no longer being looked at (that’s what the `break` does, essentially moving to the next `word`). — poke, Jan 27 '17 at 17:52

Richard · Answer 2 · 2017-01-27T08:07:05.773

Put all your valid words in a Python set and then loop through the set rearranging the words in the manner you've described. For each rearrangement, check to see if it is in the set.

Since Python's set is based on a hash table the look-ups occur in O(1) (constant) time. For a constant number of rearrangements per word, your algorithm then works in O(n) time, which is far better than the O(n^2) algorithm you have now.

The revised code looks like this:

nineWordSet = set(nineWordList)
for i in nineWordSet:
  if i[3:5] + i[0:2] + i[6:8] in nineWordSet:
    correctWords.append(i)
  elif i[3:5] + i[6:8] + i[0:2] in nineWordSet:
    correctWords.append(i)
  elif i[0:2] + i[6:8] + i[3:5] in nineWordSet:
    correctWords.append(i)
  elif i[6:8] + i[0:2] + i[3:5] in nineWordSet:
    correctWords.append(i)
  elif i[6:8] + i[3:5] + i[0:2] in nineWordSet:
    correctWords.append(i)

Your previous code was slow because for each word you had to look at all the other words (technically, half of them, on average). That's about 2,312,000,000 words you have to look at; that's what's meant by O(n^2). In the new code for each word you only have to look in one well-defined place, so you only look at 68,000 words. That's the benefit of hash tables, which can often give you O(n) performance on a dataset.

How to efficiently compare large lists in Python?

2 Answers2