An efficient way to search similar words (with specified length) in two strings using python

Question

My input is two strings of the same length and a number which represents the length of the common words I need to find in both strings. I wrote a very straightforward code to do so, and it works, but it is super super slow, considering the fact that each string is ~200K letters.

This is my code:

for i in range(len(X)):
    for j in range(len(Y)):
        if(X[i] == Y[j]):
            for k in range (kmer):                
                if (X[i+k] == Y[j+k]):
                    count +=1
                else:
                    count=0
                if(count == int(kmer)):
                    loc=(i,j)
                    pos.append(loc)
                    count=0    

        if(Xcmp[i] == Y[j]):
            for k in range (kmer):                
                if (Xcmp[i+k] == Y[j+k]):
                    count +=1
                else:
                    count=0
                if(count == int(kmer)):
                    loc=(i,j)
                    pos.append(loc)
                    count=0

return pos

Where the first sequence is X and the second is Y and kmer is the length of the common words. (and when I say word, I just mean characters..)

I was able to create a X by kmer matrix (rather than the huge X by Y) but that's still very slow.

I also thought about using a trie, but thought that maybe it will take too long to populate it?

At the end I only need the positions of those common subsequences.

any ideas on how to improve my algorithm? Thanks!! :)

A trie sounds like a good idea. If I understand correctly, you can limit the depth of the trie to `kmer` — John La Rooy, Jan 16 '14 at 02:47
Thanks! any hints on how to start implementing a trie? specifically with a limited depth? — FairyDuster, Jan 16 '14 at 02:51
Simply insert `X[0:kmer]`, `X[1:kmer+1]`, ... then the depth will never exceed `kmer` — John La Rooy, Jan 16 '14 at 02:54
0,1,.. up to the length of the string? wouldn't that still be inefficient? (not sure if I understand how it works..) — FairyDuster, Jan 16 '14 at 03:01
If I understand your problem, you could easily [adapt the code in this answer](http://stackoverflow.com/questions/20267564/find-maximum-length-of-all-n-word-length-substrings-shared-by-two-strings/20917808#20917808). — Tim Peters, Jan 16 '14 at 03:11

John La Rooy · Accepted Answer · 2014-01-16T03:32:15.133

Create a set of words like this

words = {X[i:i+kmer] for i in range(len(X)-kmer+1)}
for i in range(len(Y)-kmer+1):
    if Y[i:i+kmer] in words:
        print Y[i:i+kmer]

This is fairly efficient as long as kmer isn't so large that you'd run out of memory for the set. I assume it isn't since you were creating a matrix that size already.

For the positions, create a dict instead of a set as Tim suggests

from collections import defaultdict
wordmap = defaultdict(list)
for i in range(len(X)-kmer+1):
    wordmap[X[i:i+kmer]].append(i)

for i in range(len(Y)-kmer+1):
    word = Y[i:i+kmer]
    if word in wordmap:
        print word, wordmap[word], i

Since the OP wants the *positions* of the matches, `words` probably needs to be a dict mapping a string to a list of the indices the string starts at. — Tim Peters, Jan 16 '14 at 03:14

score 0 · Answer 2 · answered Jan 16 '14 at 03:00

A triple nested for loop is giving you a runtime of n^3 because you're literally going through each entry. Consider using Rolling Hash. It has a linear average runtime and worstcase n^2. It's best for finding substrings so more or less what you're doing. In this case you may be closer to n^2 but it's still pretty good over n^3.

An efficient way to search similar words (with specified length) in two strings using python

2 Answers2