Incomplete parsing of entire genbank file using python/biopython

Question

The main goal of my script is to convert a genbank file to a gtf file. My problem pertains to extracting CDS information (gene, position (e.g., CDS 2598105..2598404), codon_start, protein_id, db_xref) from all CDS entries. My script should open/parse a genbank file, extract information from each CDS entry, and write the information to another file. The script produces no errors, but only writes information from the first 1/2 of the genbank file before terminating. Here is my code...

import Bio
from Bio import GenBank
from Bio import SeqIO

fileList = ['data_files/e_coli_ref_BA000007.2.gb']
qualies = ['gene', 'protein_id', 'db_xref']

#######################################################DEFINITIONS################################################################
def strip_it(string_name):
    stripers = ['[', ']', '\'', '"']
    for s in stripers:
        string_name = string_name.replace(s, '')
        string_name = string_name.lstrip()
    return string_name

def strip_it_attributes(string_name):
    stripers = ['[', ']', '\'', '"', '{', '}',',']
    for s in stripers:
        string_name = string_name.replace(s, '') 
        string_name = string_name.lstrip() 
        string_name = string_name.replace(': ', '=')
        string_name = string_name.replace(' ', ';')
    return string_name
#---------------------------------------------------------------------------------------------------------------------------------

#######################################################################################################################
for f in fileList:
    nameOut = f.replace('gb', 'gtf')

    with open(f, 'r') as inputFile:
        with open(nameOut, 'w') as outputFile:
            record = next(SeqIO.parse(f, 'genbank'))
            seqid = record.id
            typeName = 'Gene'
            source = 'convert_gbToGFT.py'
            start_codon = 'NA'
            attribute = 'NA'    

            featureCount = 0
            for f in record.features:
                print(f.type)
                string = ''
                if f.type == 'CDS':
                    dic = {}
                    CDS = record.features[featureCount]

                    position = strip_it(str(CDS.location))
                    start = position.split(':')[0]
                    stop = position.split(':')[1].split('(')[0]
                    strand = position.split(':')[1].split('(')[1].replace(')', '')
                    score = '.'

                    for q in qualies:
                        if q in CDS.qualifiers:
                            if q not in dic:
                                dic[q] = ''
                            dic[q] = strip_it(str(CDS.qualifiers[q]))

                    attribute = strip_it_attributes(str(dic))

                    if 'codon_start' in CDS.qualifiers:
                        start_codon = str(int(str(CDS.qualifiers['codon_start'][0]))-1) #need string when finished so it can be added to variable 'string'

                    string = '\t'.join([seqid, source, typeName, start, stop, score, strand, start_codon, attribute])
                    if attribute.count(';') == 2:
                        outputFile.write(string + '\n')

                    featureCount+=1

#---------------------------------------------------------------------------------------------------------------------------------

The last line of the output file is:

BA000007.2  convert_gbToGFT.py   Gene  2598104  2598404  .  +  0  protein_i     d=BAB36052.1;db_xref=GI:13362097;gene=ECs2629

The location of gene ECs2629 appears on line 36094 in the genbank file, but the total number of lines in this file is 73498. I have re-downloaded the file multiple times to see if there was a downloading issue and I have visually inspected the file (I find no fault with it). I have also tried this script on another equally large genbank file and was met with identical issues.

Can anyone offer some suggestions as to why the entire genbank file is not parsed, how I could modify my code to remove this issue, or point me to another possible solution?

(you can see the format of a genbank file from here: http://www.ncbi.nlm.nih.gov/Sitemap/samplerecord.html), however, I am working with an E. coli genbank file (Escherichia coli O157:H7 str. Sakai DNA, complete genome) which can be found here: http://www.ncbi.nlm.nih.gov/nuccore/BA000007.2

I am using the following: Centos 6.7, Python 3.4.3 :: Anaconda 2.3.0 (64-bit), Biopython 1.66

[EDIT] @Gerrat suggestions worked for the file in question, but not for other files. Using http://www.ncbi.nlm.nih.gov/nuccore/NC_000913.3 with the suggested edit yields ~28 lines of output where my original code output 2084 lines (however, there should be 4332 lines of output).

Gerrat · Answer 1 · 2015-12-22T18:43:45.907

Change this line:

CDS = record.features[featureCount]

to:

CDS = f

You're skipping records by accessing them via the `featureCount' index (since there are probably 1/2 as many feature Counts as records).

EDIT: To elaborate on your comment:

Your original script is just wrong (w.r.t. the way you're using featureCount). My correction is necessary. If you have further issues, there is something else wrong. In this case, there appear to be 28 CDS records with an attribute count of 2. (I know nothing about gene sequencing, I'm just going by the variable names in the script). When you switch back to using featureCount, you're now looking at records where the "type" is not "CDS". It is "gene", or "repeat_region". You're checking the type of the record, f to see if it is CDS, but then using a completely different record, record.features[featureCount]. These don't refer to the same record (check the CDS.type of this record - it's no longer "CDS" in most cases).

Need to revisit this: I tried my script on a different file: http://www.ncbi.nlm.nih.gov/nuccore/NC_000913.3, and it outputs only 28 lines. A quick :%s/CDS/CDS/g revealed 4332 CDS regions -- which is significantly greater than 28. Reverting back to my original (w/ feautreCount) outputs 2084 lines. Any ideas? — cer, Dec 22 '15 at 15:07
@cer: Yup, see my Edit. (& most of these other records have an attribute count of 4 or 6, which you don't output to your file) — Gerrat, Dec 22 '15 at 18:44

score 0 · Answer 2 · answered Dec 17 '15 at 17:45

0

Out of curiosity, what happens if you iterate through each line by changing:

with open(f, 'r') as inputFile:

to

with open("file") as infile:
    for line in infile:
        do_something_with(line)

It would also be interesting to set some variable to zero before looping through the lines in the file and doing variable += 1 each time to see if the line number is what you expect

answered Dec 17 '15 at 17:45

Ryan

3,555
1
22
36

Replacing do_something_with(line) with print(line) will properly print each line of the file on the screen. – cer Dec 17 '15 at 17:53
I had also previously had a line that would augment the count by 1 if a CDS feature was encountered. This count was 1/2 what it should have been and corresponded to the CDS that contained the gene ECs2629. – cer Dec 17 '15 at 18:00

cer · Accepted Answer · 2015-12-23T19:26:31.093

Thank you @Gerrat for your comments. I re-worked the script and it works swimmingly.

import Bio 
from Bio import GenBank
from Bio import SeqIO

fileList = ['F1.gb', 'F2.gb']

for f in fileList:
      with open(f, 'rU') as handle:
         for record in SeqIO.parse(handle, 'genbank'):
            for feature in record.features:
               if feature.type=='CDS':
                  #[extract feature values here]
                  count+=1

   print('You parsed', count, 'CDS features')

Incomplete parsing of entire genbank file using python/biopython

3 Answers3