0

I need help with my script - I have spent a lot of time but have not been able to adjust the script in a way to give the expected output.

I have a file the below format:

>seq1
ACTGTAG #---> 6mers of this line are ACTGTA & CTGTAG
>seq2
ATAGGAG #---> 6mers of this line are ATAGGA & TAGGAG
>seq3
TCCTATT #---> 6mers of this line are TCCTAT & CCTATT

with my script, I want to save lines (which pass filters in the script) in an iterative way. I want to start by 'saving' the first line, and read all reverse complement 6-mers of that oligo into a list. In this example file I have two 6mers [m1,m2] from line1 and their respective complements are should be added to list2 [c1,c2] (script does this reverse complement conversion correctly) and when iteration over the seq2 - the script first should look if any of the [c1,c2] are in seq2 if yes then the seq2 should be discarded else should be saved and line2's respective 6mer complements [c3,c4] should be added to the list2 and the updated list2 should be [c1,c2,c3,c4]

The problem is that after the third iteration, list1 will contain [m1,m2,m3], and list2 will contain [c1,c1,c2,c1,c2,c3] and this makes a big problem when I run this script on a 9000 lines file. It would be great if you could help me with this.

My code is:

from Bio import SeqIO
from Bio.Seq import Seq

with open('test.fa', 'r') as file:
    list1 = []
    list2 = []
    for record in SeqIO.parse(file, 'fasta'):
        for i in range(len(record.seq)):
            kmer = str(record.seq[i:i + 6]) #gets 6 alphabet blocks (6mers)
            if len(kmer) == 6:
                list1.append(kmer)
        #print(record.seq)
        #print(list1)
         
        # reverse complement of 6mers and appends to the list2
        for kmers in list:
            C_kmer = Seq(kmers).complement()
            list2.append(C_kmer[::-1])
            # print(list2)

        cnt = 0
        if any(items in record.seq for items in list2):
            cnt += 1

        if cnt == 0:
            print('>' + record.id)
            print(record.seq)

print(list2)

the desired output should be: (because these lines do not have any reverse complement 6mers inside [c1,c2,c3,c4] but seq3 does)

>seq1
ACTGTAG 
>seq2
ATAGGAG

and the list2 should be:

[Seq('TACAGT'), Seq('CTACAG'),Seq('TCCTAT'), Seq('CTCCTA'), Seq('ATAGGA'), Seq('AATAGG')]

but currently, my list2 is as below which is wrong - it has a repetition of some reverse complement 6mers as mentioned previously.

[Seq('TACAGT'), Seq('CTACAG'), Seq('TACAGT'), Seq('CTACAG'), Seq('TCCTAT'), Seq('CTCCTA'), Seq('TACAGT'), Seq('CTACAG'), Seq('TCCTAT'), Seq('CTCCTA'), Seq('ATAGGA'), Seq('AATAGG')]
Apex
  • 1,055
  • 4
  • 22
  • 1
    Move `list1 = []` at the beginning of the `for record in SeqIO.parse(` loop. – Chris Charley Mar 17 '21 at 23:05
  • from your examples is alway not clear to us (see https://stackoverflow.com/questions/66760706/how-to-make-dictionary-using-substrings-in-a-loop-python/66776597#66776597 too) if your sequences are the one you show as your input or not. why you have tu use:< for i in range(len(record.seq)): kmer = str(record.seq[i:i + 6]) #gets 6 alphabet blocks (6mers)> on a seq1 that is just a 7 letter string ?? its kind of difficult to grasp your intention – pippo1980 Mar 24 '21 at 08:01

1 Answers1

0

I think this should work - this is my own answer, please give your inputs

with open('file.fasta', 'r') as file:
    list = []
    for record in SeqIO.parse(file, 'fasta'):

        cnt = 0
        if any(items in record.seq for items in list):
            cnt += 1
            pass
        else:
            for i in range(len(record.seq)):
                kmer = str(record.seq[i:i + 6])
                if len(kmer) == 6:
                    C_kmer = Seq(kmer).complement()
                    list.append(C_kmer[::-1])

        if cnt == 0:
            print('>' + record.id)
            print(record.seq)
Apex
  • 1,055
  • 4
  • 22