1

When creating a file using GFF.write(), i get a new line with "annotation remark" as a source, followed by ASCII encoding of sequence regions:

##gff-version 3
##sequence-region NC_011594.1 1 16779
NC_011594.1 annotation  remark  1   16779   .   .   .   gff-version=3;sequence-region=%28%27NC_011594.1%27%2C 0%2C 16971%29,%28%27NC_042493.1%27%2C 0%2C 132544852%29, (continues on and on)
NC_011594.1 RefSeq  gene    1   1531    .   +   .   Dbxref=GeneID:7055888;ID=gene-COX1;Name=COX1;gbkey=Gene;gene=COX1;gene_biotype=protein_coding

Any idea why it's here, what it's for and how i could avoid it? I fear it might become a problem when using it in third-party softwares.

I imported only the bcbio-gff package, but I believe it's part of Biopython, link: https://biopython.org/wiki/GFF_Parsing

  • A reproducible example would be nice :) Until then it is hard to know what might be wrong. https://github.com/biocore-ntnu/pyranges can also read/write GFFs, dunno if it might solve your problem. – The Unfun Cat Apr 23 '20 at 12:46
  • What I did: extract the info from a GFF file (using GFF.parse() with a limit_info being some genes) , and then send it directly to GFF.write() to create a new file with only the selected genes – Felix Jaeger Apr 23 '20 at 13:48
  • This question might get more answers in the bioinformatics stackexchange site: bioinformatics.stackexchange.com – bli Apr 24 '20 at 11:57

1 Answers1

0

To your first question - "Why it is there?"

  • I only presume, that by default the package author wanted to export as much information as possible.

To your next question - "How can I avoid it?"

  • Unfortunately there is no off switch. For me the solution was to remove any annotations from the exported sequences. (i.e. set the annotations attribute to empty dictionary before calling the GFF.write().

Example:

from Bio import SeqIO
from BCBio import GFF

g = SeqIO.read('NC_003888.3.gb','gb')

g.annotations = {}

with open('t2.gff', 'w') as f:
    GFF.write([g], f)

Output file head - no # annotation remark

head t2.gff 
##gff-version 3
##sequence-region NC_003888.3 1 8667507
NC_003888.3 feature source  1   8667507 ... removed for clarity ....
Marek Schwarz
  • 578
  • 6
  • 10