Convert GenBank Flatfiles to FASTA

Question

I need to parse a preliminary GenBank Flatfile. The sequence hasn't been published yet, so I can't look it up by accession and download a FASTA file. I'm new to Bioinformatics, so could someone show me where I could find a BioPerl or BioPython script to do this myself? Thanks!

Googling for "biopython parse fasta" gives this http://www.biopython.org/wiki/SeqIO as 1st match. 2nd match is a tutorial for parsing fasta. is this what you are looking for? — Fredrik Pihl, Jun 13 '11 at 22:01
And of course googling for "bioperl parse fasta" also gives proper results, like this FAQ: "I want to parse FASTA or NCBI -m7 (XML) format, how do I do this?" at http://www.bioperl.org/wiki/FAQ#I_want_to_parse_FASTA_or_NCBI_-m7_.28XML.29_format.2C_how_do_I_do_this.3F — mirod, Jun 14 '11 at 07:16

score 0 · Answer 1 · answered Nov 23 '13 at 16:44

I have the Biopython solution for you here. I will firstly assume your genbank file relates to a genome sequence, then I will provide a different solution assuming it was instead a gene sequence. Indeed it would have been helpful to have known which of these you are dealing with.

Genome Sequence Parsing:

Parse in your custom genbank flatfile from file by:

from Bio import SeqIO
record = SeqIO.read("yourGenbankFileDirectory/yourGenbankFile.gb","genbank")

If you just want the raw sequence then:

rawSequence = record.seq.tostring()

Now perhaps you need a name for this sequence, to give the sequence a ">header" before making the .fasta. Let's see what names came with the genbank .gb file:

nameSequence = record.features[0].qualifiers

This should return a dictionary with various synonyms of that whole sequence as annotated by author of that genbank file

Gene Sequence Parsing:

Parse in your custom genbank flatfile from file by:

from Bio import SeqIO
record = SeqIO.read("yourGenbankFileDirectory/yourGenbankFile.gb","genbank")

To get a list of raw sequences for the gene/list of all genes then:

rawSequenceList = [gene.extract(record.seq.tostring()) for gene in record.features]

To get a list of names for each gene sequence (more precisely a dictionary of synonyms for each gene)

nameSequenceList = [gene.qualifiers for gene in record.features]

score 0 · Accepted Answer · answered Jun 14 '11 at 10:38

0

You need the Bio::SeqIO module to read or write out bioinformatics data. The SeqIO HOWTO should tell you everything you need to know, but here's a small read-a-GenBank-file script in Perl to get you started!

answered Jun 14 '11 at 10:38

Gaurav

1,888
1
18
23

1

Thanks, this works for normal genbank files, but mine's actually a preliminary submission, so I'll have to parse it myself. – user81997 Jun 14 '11 at 13:24

Convert GenBank Flatfiles to FASTA

2 Answers2