Incorrect parse with BCBio's GFF parser

Question

I'm experimenting with BCBio's GFF parser, in the hope I can use it for my tool. I've taken a test .gbk file from NCBI's RefSeq database, and used it to parse into a .gff file.

Code I used (from http://biopython.org/wiki/GFF_Parsing):

#!/usr/bin/python
from BCBio import GFF
from Bio import SeqIO

def convert_to_GFF3():
    in_file = "/var/www/localhost/NC_009925.gbk"
    out_file = "/var/www/localhost/output/your_file.gff"
    in_handle = open(in_file)
    out_handle = open(out_file, "w")

    GFF.write(SeqIO.parse(in_handle, "genbank"), out_handle)

    in_handle.close()
    out_handle.close()

convert_to_GFF3()

Here is part of the outcome:

##gff-version 3
##sequence-region NC_009925.1 1 6503724
NC_009925.1 annotation  remark  1   6503724 .   .   .   accessions=NC_009925;comment=PROVISIONAL REFSEQ: This record has not yet been subject to final%0ANCBI review. The reference sequence was derived from CP000828.%0ASource bacteria from Marine Biotechnology Institute Culture%0ACollection%2C Marine Biotechnology Institute%2C 3-75-1 Heita%2C Kamaishi%2C%0AIwate 026-0001%2C Japan.%0ACOMPLETENESS: full length.;data_file_division=CON;date=10-JUN-2013;gi=158333233;keywords=;organism=Acaryochloris marina MBIC11017;references=location: %5B0:6503724%5D%0Aauthors: Swingley%2CW.D.%2C Chen%2CM.%2C Cheung%2CP.C.%2C Conrad%2CA.L.%2C Dejesa%2CL.C.%2C Hao%2CJ.%2C Honchak%2CB.M.%2C Karbach%2CL.E.%2C Kurdoglu%2CA.%2C Lahiri%2CS.%2C Mastrian%2CS.D.%2C Miyashita%2CH.%2C Page%2CL.%2C Ramakrishna%2CP.%2C Satoh%2CS.%2C Sattley%2CW.M.%2C Shimada%2CY.%2C Taylor%2CH.L.%2C Tomo%2CT.%2C Tsuchiya%2CT.%2C Wang%2CZ.T.%2C Raymond%2CJ.%2C Mimuro%2CM.%2C Blankenship%2CR.E. and Touchman%2CJ.W.%0Atitle: Niche adaptation and genome expansion in the chlorophyll d-producing cyanobacterium Acaryochloris marina%0Ajournal: Proc. Natl. Acad. Sci. U.S.A. 105 %286%29%2C 2005-2010 %282008%29%0Amedline id: %0Apubmed id: 18252824%0Acomment:,location: %5B0:6503724%5D%0Aauthors: %0Aconsrtm: NCBI Genome Project%0Atitle: Direct Submission%0Ajournal: Submitted %2817-OCT-2007%29 National Center for Biotechnology Information%2C NIH%2C Bethesda%2C MD 20894%2C USA%0Amedline id: %0Apubmed id: %0Acomment:,location: %5B0:6503724%5D%0Aauthors: Touchman%2CJ.W.%0Atitle: Direct Submission%0Ajournal: Submitted %2827-AUG-2007%29 Pharmaceutical Genomics Division%2C Translational Genomics Research Institute%2C 13208 E Shea Blvd%2C Scottsdale%2C AZ 85004%2C USA%0Amedline id: %0Apubmed id: %0Acomment:;sequence_version=1;source=Acaryochloris marina MBIC11017;taxonomy=Bacteria,Cyanobacteria,Oscillatoriophycideae,Chroococcales,Acaryochloris
NC_009925.1    feature  source  1   6503724 .   +   .   db_xref=taxon:329726;mol_type=genomic DNA;note=type strain of Acaryochloris marina;organism=Acaryochloris marina MBIC11017;strain=MBIC11017
NC_009925.1    feature  gene    931 1581    .   -   .   db_xref=GeneID:5685235;locus_tag=AM1_0001;note=conserved hypothetical protein;pseudo=
NC_009925.1    feature  gene    1627    2319    .   -   .   db_xref=GeneID:5678840;locus_tag=AM1_0003

The problem lies in the third and the fourth line: it takes the complete header info from the .gbk and puts it in as a line, while it should skip it. The last two lines are correct (and so is the rest of the output file). I've tried using several different .gbk files, all yield the same outcome.

For reference, here's the beginning of the .gbk file:

LOCUS       NC_009925            6503724 bp    DNA     circular CON 10-JUN-2013
DEFINITION  Acaryochloris marina MBIC11017 chromosome, complete genome.
ACCESSION   NC_009925
VERSION     NC_009925.1  GI:158333233
DBLINK      Project: 58167
            BioProject: PRJNA58167
KEYWORDS    .
SOURCE      Acaryochloris marina MBIC11017
  ORGANISM  Acaryochloris marina MBIC11017
            Bacteria; Cyanobacteria; Oscillatoriophycideae; Chroococcales;
            Acaryochloris.
REFERENCE   1  (bases 1 to 6503724)
  AUTHORS   Swingley,W.D., Chen,M., Cheung,P.C., Conrad,A.L., Dejesa,L.C.,
            Hao,J., Honchak,B.M., Karbach,L.E., Kurdoglu,A., Lahiri,S.,
            Mastrian,S.D., Miyashita,H., Page,L., Ramakrishna,P., Satoh,S.,
            Sattley,W.M., Shimada,Y., Taylor,H.L., Tomo,T., Tsuchiya,T.,
            Wang,Z.T., Raymond,J., Mimuro,M., Blankenship,R.E. and
            Touchman,J.W.
  TITLE     Niche adaptation and genome expansion in the chlorophyll
            d-producing cyanobacterium Acaryochloris marina
  JOURNAL   Proc. Natl. Acad. Sci. U.S.A. 105 (6), 2005-2010 (2008)
   PUBMED   18252824
REFERENCE   2  (bases 1 to 6503724)
  CONSRTM   NCBI Genome Project
  TITLE     Direct Submission
  JOURNAL   Submitted (17-OCT-2007) National Center for Biotechnology
            Information, NIH, Bethesda, MD 20894, USA
REFERENCE   3  (bases 1 to 6503724)
  AUTHORS   Touchman,J.W.
  TITLE     Direct Submission
  JOURNAL   Submitted (27-AUG-2007) Pharmaceutical Genomics Division,
            Translational Genomics Research Institute, 13208 E Shea Blvd,
            Scottsdale, AZ 85004, USA
COMMENT     PROVISIONAL REFSEQ: This record has not yet been subject to final
            NCBI review. The reference sequence was derived from CP000828.
            Source bacteria from Marine Biotechnology Institute Culture
            Collection, Marine Biotechnology Institute, 3-75-1 Heita, Kamaishi,
            Iwate 026-0001, Japan.
            COMPLETENESS: full length.
FEATURES             Location/Qualifiers
     source          1..6503724
                     /organism="Acaryochloris marina MBIC11017"
                     /mol_type="genomic DNA"
                     /strain="MBIC11017"
                     /db_xref="taxon:329726"
                     /note="type strain of Acaryochloris marina"
     gene            complement(931..1581)
                     /locus_tag="AM1_0001"
                     /note="conserved hypothetical protein"
                     /pseudo
                     /db_xref="GeneID:5685235"
     gene            complement(1627..2319)
                     /locus_tag="AM1_0003"
                     /db_xref="GeneID:5678840"
     CDS             complement(1627..2319)
                     /locus_tag="AM1_0003"
                     /codon_start=1
                     /transl_table=11
                     /product="NUDIX hydrolase"
                         /protein_id="YP_001514406.1"
                     /db_xref="GI:158333234"
                     /db_xref="GeneID:5678840"
                     /translation="MPYTYDYPRPGLTVDCVVFGLDEQIDLKVLLIQRQIPPFQHQWA
                 LPGGFVQMDESLEDAARRELREETGVQGIFLEQLYTFGDLGRDPRDRIISVAYYALIN
                 LIEYPLQASTDAEDAAWYSIENLPSLAFDHAQILKQAIRRLQGKVRYEPIGFELLPQK
                 FTLTQIQQLYETVLGHPLDKRNFRKKLLKMDLLIPLDEQQTGVAHRAARLYQFDQSKY
                 ELLKQQGFNFEV"

Does anyone know how I can solve this?

I've used the following line to filter out the first two wrong lines:

if "\tannotation\t" in line or "feature\tsource" in line:

This seems to work on several test .gbk's. But I'm still curious as to why it parses those in the first place?

score 1 · Answer 1 · answered Dec 02 '13 at 01:56

The answer is in the wiki page you linked (http://biopython.org/wiki/GFF_Parsing#Writing_GFF3). "The GFF3Writer takes an iterator of SeqRecord objects, and writes each SeqFeature as a GFF3 line". The SeqRecord objects parsed from the .gbk file contain this annotation, hence it is written by the writer. In the implementation (https://github.com/chapmanb/bcbb/blob/master/gff/BCBio/GFF/GFFOutput.py) you can see where it is done:

self._write_annotations(rec.annotations, rec.id, len(rec.seq), out_handle)

There you can also see why the source feature is passed. It is just a feature like the others (gene, CDS) and not treated separately.

I do not know why there is no option or parameter (at least I haven't found any) to tell the writer to skip the annotations. I don't know of any parameter to skip the annotations when reading the SeqRecords with SeqIO.parse() either.

To solve your problem I access the parsed SeqRecords separately, delete the annotations and the source feature. One downside of this approach is the additional memory needed (as well as performance loss) because I'm creating a List from the initial generator. In the end I just parse the List to GFF. I don't know if this approach is so much better than yours.

#!/usr/bin/env python
from BCBio import GFF
from Bio import SeqIO

def convert_to_GFF3():
    in_file = "input.gbk"
    out_file = "output.gff"
    in_handle = open(in_file)
    out_handle = open(out_file, "w")

    records = []
    for record in SeqIO.parse(in_handle, "genbank"):
        # delete annotations
        record.annotations = {}
        # loop through features to find the source
        for i in range(0,len(record.features)):
            # if found, delete it and stop (only expect one source)
            if(record.features[i].type == "source"):
                record.features.pop(i)
                break
        records.append(record)

    GFF.write(records, out_handle)

    in_handle.close()
    out_handle.close()

convert_to_GFF3()

Incorrect parse with BCBio's GFF parser

1 Answers1