1

I am new to Biopython and I have a performance issue when parsing genbank files.

I have to parse a lot of gb files, from which I have the accession numbers. After parsing, I only want to examine the taxonomy and the organelle of the file. Right now, I have this code:

from Bio import SeqIO
from Bio import Entrez
gb_acc1 = Entrez.efetch(db='nucleotide', id=access1, rettype='gb', retmode='text')   #Where access1 contents the accession number
rec = SeqIO.read(gb_acc1, 'genbank')
cache[access1] = rec   #where cache is just a dictionary where saving the gb files already downloaded
feat = cache[access1].features[0]   
if 'organelle' in feat.qualifiers.keys(): #And the code goes on

In order to look for the taxonomy I have:

gi_h = Entrez.efetch(db='nucleotide', id=access, rettype='gb', retmode='text')
    gi_rec = SeqIO.read(gi_h, 'genbank')
    cache[access]=gi_rec
    if cache[access].annotations['taxonomy'][1] == 'Fungi':
                                fungi += 1 #And the code goes on

This (the whole script) works fine. My problem is that I am downloading the whole gb file (which sometimes is huge) just to look into these 2 features: the organelle and the taxonomy. If I could only download this part of the gb file my script would be much faster, but I have not figured out if this is possible.

Does someone know if this can be done, and if so, how? Thanks a lot in advance

1 Answers1

1

You can use seq_start and seq_stop to truncate your sequence and then parse it as before, e.g.

gb_acc1 = Entrez.efetch(db='nuccore', id=access1, rettype='gb', retmode='xml', seq_start=1, seq_stop=1)

Perhaps you don't even need to store the whole GenBank file but only a dictionary with the ID as key and taxonomy and organelle as values?

Maximilian Peters
  • 30,348
  • 12
  • 86
  • 99
  • Thanks!! Both suggestions were a nice idea and seem to speed up the script. However, is still not the best solution I am looking for. Check for example this gb file http://www.ncbi.nlm.nih.gov/nuccore/CP015199 Here the problem is not the sequence, but the huge amount of feautres (CDS, gene), which I am still downloading and not looking into afterwards. But thanks!! – VictorBello Jul 28 '16 at 09:18