I have a file that contains millions of sequences. What I want to do is to get 5mers from each sequence in every line of my file.
My file looks like this:
CGATGCATAGGAA
GCAGGAGTGATCC
my code is:
with open('test.txt','r') as file:
for line in file:
for i in range(len(line)):
kmer = str(line[i:i+5])
if len(kmer) == 5:
print(kmer)
else:
pass
with this code, I should not get 4 mers but I do even I have an if statement
for the length of 5mers. Could anyone help me with this? Thanks
my out put is:
CGATG
GATGC
ATGCA
TGCAT
GCATA
CATAG
ATAGG
TAGGA
AGGAA
GGAA
GCAGG
CAGGA
AGGAG
GGAGT
GAGTG
AGTGA
GTGAT
TGATC
GATCC
ATCC
but the ideal output should be only the one with length equal to 5 (for each line separately):
CGATG
GATGC
ATGCA
TGCAT
GCATA
CATAG
ATAGG
TAGGA
AGGAA
GCAGG
CAGGA
AGGAG
GGAGT
GAGTG
AGTGA
GTGAT
TGATC
GATCC