I'm trying to write a programme that shifts through the elements, of defined length, of a DNA sequence, I can't understand the output I'm getting from the loop. It seems to frameshift fine for the first four iterations of the loop, then seems to revert to old sequences. I've tried really hard to understand the behaviour but I'm too new to programming to solve this, any help much appreciated.
Here is my code:
seq = "ACTGCATTTTGCATTTT"
search = "TGCATTTTG"
import regex as re
def kmers(text,n):
for a in text:
b = text[text.index(a):text.index(a)+n]
c = len(re.findall(b, text, overlapped=True))
print ("the count for " + b + " is " + str(c))
(kmers(seq,3))
and my output:
the count for ACT is 1
the count for CTG is 1
the count for TGC is 2
the count for GCA is 2
#I expected 'CAT' next, from here on I don't understand the behaviour
the count for CTG is 1
the count for ACT is 1
the count for TGC is 2
the count for TGC is 2
the count for TGC is 2
the count for TGC is 2
the count for GCA is 2
the count for CTG is 1
the count for ACT is 1
the count for TGC is 2
the count for TGC is 2
the count for TGC is 2
the count for TGC is 2
Obviously eventually I want to remove duplicates, etc, but being stuck on why my for loop isn't working how I expected it to has stopped me in my tracks to make this better.
Thanks