I need help with my script - I have spent a lot of time but have not been able to adjust the script in a way to give the expected output.
I have a file the below format:
>seq1
ACTGTAG #---> 6mers of this line are ACTGTA & CTGTAG
>seq2
ATAGGAG #---> 6mers of this line are ATAGGA & TAGGAG
>seq3
TCCTATT #---> 6mers of this line are TCCTAT & CCTATT
with my script, I want to save lines (which pass filters in the script) in an iterative way. I want to start by 'saving' the first line, and read all reverse complement 6-mers of that oligo into a list. In this example file I have two 6mers [m1,m2] from line1 and their respective complements are should be added to list2 [c1,c2] (script does this reverse complement conversion correctly)
and when iteration over the seq2 - the script first should look if any of the [c1,c2] are in seq2 if yes then the seq2 should be discarded else should be saved and line2's respective 6mer complements [c3,c4] should be added to the list2 and the updated list2 should be [c1,c2,c3,c4]
The problem is that after the third iteration, list1 will contain [m1,m2,m3], and list2 will contain [c1,c1,c2,c1,c2,c3] and this makes a big problem when I run this script on a 9000 lines file. It would be great if you could help me with this.
My code is:
from Bio import SeqIO
from Bio.Seq import Seq
with open('test.fa', 'r') as file:
list1 = []
list2 = []
for record in SeqIO.parse(file, 'fasta'):
for i in range(len(record.seq)):
kmer = str(record.seq[i:i + 6]) #gets 6 alphabet blocks (6mers)
if len(kmer) == 6:
list1.append(kmer)
#print(record.seq)
#print(list1)
# reverse complement of 6mers and appends to the list2
for kmers in list:
C_kmer = Seq(kmers).complement()
list2.append(C_kmer[::-1])
# print(list2)
cnt = 0
if any(items in record.seq for items in list2):
cnt += 1
if cnt == 0:
print('>' + record.id)
print(record.seq)
print(list2)
the desired output should be: (because these lines do not have any reverse complement 6mers inside [c1,c2,c3,c4] but seq3 does)
>seq1
ACTGTAG
>seq2
ATAGGAG
and the list2 should be:
[Seq('TACAGT'), Seq('CTACAG'),Seq('TCCTAT'), Seq('CTCCTA'), Seq('ATAGGA'), Seq('AATAGG')]
but currently, my list2 is as below which is wrong - it has a repetition of some reverse complement 6mers as mentioned previously.
[Seq('TACAGT'), Seq('CTACAG'), Seq('TACAGT'), Seq('CTACAG'), Seq('TCCTAT'), Seq('CTCCTA'), Seq('TACAGT'), Seq('CTACAG'), Seq('TCCTAT'), Seq('CTCCTA'), Seq('ATAGGA'), Seq('AATAGG')]