I need to parse a preliminary GenBank Flatfile. The sequence hasn't been published yet, so I can't look it up by accession and download a FASTA file. I'm new to Bioinformatics, so could someone show me where I could find a BioPerl or BioPython script to do this myself? Thanks!
-
3Googling for "biopython parse fasta" gives this http://www.biopython.org/wiki/SeqIO as 1st match. 2nd match is a tutorial for parsing fasta. is this what you are looking for? – Fredrik Pihl Jun 13 '11 at 22:01
-
And of course googling for "bioperl parse fasta" also gives proper results, like this FAQ: "I want to parse FASTA or NCBI -m7 (XML) format, how do I do this?" at http://www.bioperl.org/wiki/FAQ#I_want_to_parse_FASTA_or_NCBI_-m7_.28XML.29_format.2C_how_do_I_do_this.3F – mirod Jun 14 '11 at 07:16
2 Answers
I have the Biopython solution for you here. I will firstly assume your genbank file relates to a genome sequence, then I will provide a different solution assuming it was instead a gene sequence. Indeed it would have been helpful to have known which of these you are dealing with.
Genome Sequence Parsing:
Parse in your custom genbank flatfile from file by:
from Bio import SeqIO
record = SeqIO.read("yourGenbankFileDirectory/yourGenbankFile.gb","genbank")
If you just want the raw sequence then:
rawSequence = record.seq.tostring()
Now perhaps you need a name for this sequence, to give the sequence a ">header" before making the .fasta. Let's see what names came with the genbank .gb file:
nameSequence = record.features[0].qualifiers
This should return a dictionary with various synonyms of that whole sequence as annotated by author of that genbank file
Gene Sequence Parsing:
Parse in your custom genbank flatfile from file by:
from Bio import SeqIO
record = SeqIO.read("yourGenbankFileDirectory/yourGenbankFile.gb","genbank")
To get a list of raw sequences for the gene/list of all genes then:
rawSequenceList = [gene.extract(record.seq.tostring()) for gene in record.features]
To get a list of names for each gene sequence (more precisely a dictionary of synonyms for each gene)
nameSequenceList = [gene.qualifiers for gene in record.features]

- 2,039
- 2
- 21
- 51
You need the Bio::SeqIO module to read or write out bioinformatics data. The SeqIO HOWTO should tell you everything you need to know, but here's a small read-a-GenBank-file script in Perl to get you started!

- 1,888
- 1
- 18
- 23
-
1Thanks, this works for normal genbank files, but mine's actually a preliminary submission, so I'll have to parse it myself. – user81997 Jun 14 '11 at 13:24