My input is two strings of the same length and a number which represents the length of the common words I need to find in both strings. I wrote a very straightforward code to do so, and it works, but it is super super slow, considering the fact that each string is ~200K letters.
This is my code:
for i in range(len(X)):
for j in range(len(Y)):
if(X[i] == Y[j]):
for k in range (kmer):
if (X[i+k] == Y[j+k]):
count +=1
else:
count=0
if(count == int(kmer)):
loc=(i,j)
pos.append(loc)
count=0
if(Xcmp[i] == Y[j]):
for k in range (kmer):
if (Xcmp[i+k] == Y[j+k]):
count +=1
else:
count=0
if(count == int(kmer)):
loc=(i,j)
pos.append(loc)
count=0
return pos
Where the first sequence is X and the second is Y and kmer is the length of the common words. (and when I say word, I just mean characters..)
I was able to create a X by kmer matrix (rather than the huge X by Y) but that's still very slow.
I also thought about using a trie, but thought that maybe it will take too long to populate it?
At the end I only need the positions of those common subsequences.
any ideas on how to improve my algorithm? Thanks!! :)