0

I build a code to print strings if a substring exists at a particular section of the main string. I have a file as below and I create 5 alphabet substrings (5mers) from the seq11_rv.

>seq11_fw
TCAGATGTGTATAAGAGACAGTTATTAGCCGGTTCCAGGTATGCAGTATGAGAA
>seq11_rv
GAGATTATGTGGGAAAGTTCATGGAATCGAGCGGAGATGTGTATAAGAGACAGTGCCGCGCTTCACTAGAAGTCATACTGC

Then I make a reverse-complement of these 5mers and append them to a list. Then I looked into the seq11_fw and if position [42:51] (GCAGTATGA in the seq11_fw) has any of items of a list then a confirmation should be printed.

To just make it easy to understand the last 5mer of the seq11_rv is ACTGC which its reverse-complement becomes GCAGT and if you check the seq11_fw[42:51] this GCAGT exists inside that location but I do not get any output.

Any help would be appreciated.

here is my code:

from Bio import SeqIO
from Bio.Seq import Seq

with open(file, 'r') as f:
    lst = []
    for record in SeqIO.parse(f, 'fasta'):
        if len(record.seq) == 81:
            for i in range(len(record.seq)):
                kmer = str(record.seq[i:i + 5])
                if len(kmer) == 5:
                    C_kmer = Seq(kmer).complement()
                    lst.append(C_kmer[::-1])


        cnt=0
        if len(record.seq) == 54 and any(str(items) in str(record.seq[42:51]) for items in lst):
            cnt +=1

        if cnt == 1:
            print(record.id)
            print(record.seq)

    print(lst)
Wiktor Stribiżew
  • 607,720
  • 39
  • 448
  • 563
Apex
  • 1,055
  • 4
  • 22

1 Answers1

1

this one seems to work, problem is the way you set up your algorithm:

from Bio import SeqIO
from Bio.Seq import Seq

file ='test.faa'

with open(file, 'r') as f:
    lst = []
    for record in SeqIO.parse(f, 'fasta'):
        if len(record.seq) == 81:
            for i in range(len(record.seq)):
                kmer = str(record.seq[i:i + 5])
                if len(kmer) == 5:
                    C_kmer = Seq(kmer).complement()
                    lst.append(C_kmer[::-1])

with open(file, 'r') as f:    
    cnt=0
    for record in SeqIO.parse(f, 'fasta'):
        if len(record.seq) == 54 and any(str(items) in str(record.seq[41:52]) for items in lst):
            cnt +=1

        if cnt == 1:
            print(record.id)
            print(record.seq)
            cnt = 0

in this version you iterate over yor input file twice :

first times creating the 5mers list

second one checking for the [41:52] of record.seq of lenght = 54 in the list.

You need to reset the counter cnt to zero otherwise all the sequences will be printed

If you plan to have test file with more of one sequence, I would try to have one input file with the forward sequences and one input file with the reverse sequences, having the same order. Then I would make the check for every couple looping on the two files in parallel. Dont know how to do it , but here you have plenty of good examples :

How to iterate through two lists in parallel?

I believe you need to read the two records with SeqIO.parse into two list and then go with them like:

from Bio import SeqIO
from Bio.Seq import Seq

file1 ='test_fw.faa'
file2 ='test_rv.faa'

record1_lst = []
record2_lst = []

with open(file1, 'r') as f1:
    for record in SeqIO.parse(f1, 'fasta'):
        record1_lst.append(record)
with open(file2, 'r') as f2:   
    for record in SeqIO.parse(f2, 'fasta'):
        record2_lst.append(record)
    

for record_fw, record_rv in zip(record1_lst, record2_lst):
    print(record_fw.id, record_rv.id)

or work like this, remembering to close the files once the parser iterator is emptied:

from Bio import SeqIO
from Bio.Seq import Seq

file1 ='test_fw.faa'
file2 ='test_rv.faa'



f1 = open(file1, 'r')
f2 = open(file2, 'r')
    
record1 = SeqIO.parse(f1, 'fasta')
record2 = SeqIO.parse(f2, 'fasta')


for record_fw, record_rv in zip(record1, record2):
    print(record_fw.id, record_rv.id)
    
f1.close()
f2.close() 
pippo1980
  • 2,181
  • 3
  • 14
  • 30