modify fasta file with a function using biopython

Question

I should do this command for thounsands of fasta file, so I'm wondering if there is a function to accelerate the process

from Bio import SeqIO 

new= open("new.fasta", "w")   
for rec in SeqIO.parse("old.fasta","fasta"):
    print(rec.id)
    print(rec.seq.reverse_complement())
    new.write(">rc_"+rec.id+"\n")
    new.write(str(rec.seq.reverse_complement())+"\n")
new.close()

What needs to be accelerated? Looping over the thousands of files? — MattDMo, Dec 03 '22 at 22:14
Collate the two instances of new.write should save something ? — pippo1980, Dec 04 '22 at 11:21
Not an expert on buffers but you can try to use bigger buffer if have enough memory see something on thd lines of https://stackoverflow.com/questions/3167494/how-often-does-python-flush-to-a-file — pippo1980, Dec 04 '22 at 11:25
You could try to parallelize your script with multiprocessing module see approved answer here https://stackoverflow.com/questions/9786102/how-do-i-parallelize-a-simple-python-loop — pippo1980, Dec 04 '22 at 11:30
See here : Speed up iterating through a file https://stackoverflow.com/questions/74089404/speed-up-iterating-through-a-file/74089426#comment130828900_74089426 — pippo1980, Dec 04 '22 at 12:22
@MattDMo yes, I would to write a function with def instead of for cycle for the loop — jonny jeep, Dec 04 '22 at 13:48
Check here What is the fastest way to get the reverse complement of a DNA sequence in python? https://bioinformatics.stackexchange.com/questions/3583/what-is-the-fastest-way-to-get-the-reverse-complement-of-a-dna-sequence-in-pytho — pippo1980, Dec 04 '22 at 19:20

score 1 · Accepted Answer · answered Dec 04 '22 at 19:46

I rewrote you code into a function that can be called using each filename you have, possibly collected into a list using os.listdir().

from Bio import SeqIO

def parse_file(filename):
    new_name = f"rc_{filename}"
    with open(new_name, "w") as new:
        for rec in SeqIO.parse(filename, "fasta"):
            print(rec_id:=rec.id)
            print(rev_comp:=str(rec.seq.reverse_complement()))
            new.write(f">rc_{rec_id}\n{rev_comp}\n")

I used f-strings to create both the new filename and the strings written to that file. I also used the "walrus operator" to assign the values of rec.id and rec.seq.reverse_complement() to temp variables so we don't have to run those operations again when we write the data. This will save compute cycles and time over the long run. However, use of := means the code will only run under Python 3.8 and later.

modify fasta file with a function using biopython

1 Answers1