I am writing a program that takes a large XML file full of protein sequences from UniProt and analyses them. I have a working version already however that is for a smaller file. I now need to load in the full UniProt database efficiently and am unsure on how to go about it. I also want to extract specific attributes from the SeqRecord
objects that are created so there is only the data that I am interested in. I have the following:
import gzip
from Bio import SeqIO
file = "/Users/john/workspace/project-2/resources/uniprot_sprot_small.xml.gz"
def load_uniprot_records():
records = []
handle = gzip.open(file)
for record in SeqIO.parse(handle, "uniprot-xml"):
records.append(record)
print(record)
if __name__ == "__main__":
load_uniprot_records()
This presents some records that look like this:
ID: Q6GZX4
Name: 001R_FRG3G
Description: Putative transcription factor 001R
Database cross-references: DOI:10.1016/j.virol.2004.02.019, EMBL:AY548484, GO:GO:0046782, GeneID:2947773, InterPro:IPR007031, KEGG:vg:2947773, NCBI Taxonomy:654924, Pfam:PF04947, Proteomes:UP000008770, PubMed:15165820, RefSeq:YP_031579.1, Swiss-Prot:001R_FRG3G, Swiss-Prot:Q6GZX4, SwissPalm:Q6GZX4
Number of features: 2
/dataset=Swiss-Prot
/created=2011-06-28
/modified=2019-06-05
/version=37
/accessions=['Q6GZX4']
/recommendedName_fullName=['Putative transcription factor 001R']
/gene_name_ORF=['FV3-001R']
/taxonomy=['Viruses', 'Iridoviridae', 'Alphairidovirinae', 'Ranavirus']
/organism=Frog virus 3 (isolate Goorha) (FV-3)
/organism_host=['Ambystoma', 'mole salamanders', 'Dryophytes versicolor', 'chameleon treefrog', 'Lithobates pipiens', 'Northern leopard frog', 'Rana pipiens', 'Notophthalmus viridescens', 'Eastern newt', 'Triturus viridescens', 'Rana sylvatica', 'Wood frog']
/references=[Reference(title='Comparative genomic analyses of frog virus 3, type species of the genus Ranavirus (family Iridoviridae).', ...)]
/comment_function=['Transcription activation.']
/proteinExistence=['predicted']
/keywords=['Activator', 'Complete proteome', 'Reference proteome', 'Transcription', 'Transcription regulation']
/type=['ECO:0000305']
/key=['1']
/sequence_length=256
/sequence_mass=29735
/sequence_checksum=B4840739BF7D4121
/sequence_modified=2004-07-19
/sequence_version=1
Seq('MAFSAEDVLKEYDRRRRMEALLLSLYYPNDRKLLDYKEWSPPRVQVECPKAPVE...TPL', ProteinAlphabet())
I want to implement an efficient function to store only the required attributes from each record in the UniProt file, which is massive.
So for example in the above protein, I'd want the ID
as the main identifier, with the corresponding (since I am using this word, something tells me I should create a dictionary for this?) Name
, taxonomy
, sequence_length
. However I am unsure on if I create this new data set of objects if I can still treat it as if it is a SeqRecord
object?
I have read through the SeqRecord class documentation but it doesn't quite cover what I am trying to do so I just need a little bit of guidance with this. Sorry if it's obvious, still getting my head around some of this!
Edit: Have now got
def load_uniprot_records():
file = "/Users/john/workspace/practical-2/resources/uniprot_sprot_small.xml.gz"
seq_records = []
handle = gzip.open(file)
for record in SeqIO.parse(handle, "uniprot-xml"):
seq_record = SeqRecord(seq=record.seq, id=record.id,
name=record.name, description=record.description,
annotations=record.annotations)
seq_records.append(seq_record)
return seq_records
However don't know how to access just the taxonomy part of the annotations dictionary. Having annotations=record.annotations
simply assigns every annotation, whereas I just need one or two of them (taxonomy, sequence_length). Any ideas?