How to efficiently read large XML file and create custom objects? (Biopython SeqIO)

Question

I am writing a program that takes a large XML file full of protein sequences from UniProt and analyses them. I have a working version already however that is for a smaller file. I now need to load in the full UniProt database efficiently and am unsure on how to go about it. I also want to extract specific attributes from the SeqRecord objects that are created so there is only the data that I am interested in. I have the following:

import gzip
from Bio import SeqIO

file = "/Users/john/workspace/project-2/resources/uniprot_sprot_small.xml.gz"


def load_uniprot_records():
    records = []

    handle = gzip.open(file)
    for record in SeqIO.parse(handle, "uniprot-xml"):
        records.append(record)
        print(record)


if __name__ == "__main__":
    load_uniprot_records()

This presents some records that look like this:

ID: Q6GZX4
Name: 001R_FRG3G
Description: Putative transcription factor 001R
Database cross-references: DOI:10.1016/j.virol.2004.02.019, EMBL:AY548484, GO:GO:0046782, GeneID:2947773, InterPro:IPR007031, KEGG:vg:2947773, NCBI Taxonomy:654924, Pfam:PF04947, Proteomes:UP000008770, PubMed:15165820, RefSeq:YP_031579.1, Swiss-Prot:001R_FRG3G, Swiss-Prot:Q6GZX4, SwissPalm:Q6GZX4
Number of features: 2
/dataset=Swiss-Prot
/created=2011-06-28
/modified=2019-06-05
/version=37
/accessions=['Q6GZX4']
/recommendedName_fullName=['Putative transcription factor 001R']
/gene_name_ORF=['FV3-001R']
/taxonomy=['Viruses', 'Iridoviridae', 'Alphairidovirinae', 'Ranavirus']
/organism=Frog virus 3 (isolate Goorha) (FV-3)
/organism_host=['Ambystoma', 'mole salamanders', 'Dryophytes versicolor', 'chameleon treefrog', 'Lithobates pipiens', 'Northern leopard frog', 'Rana pipiens', 'Notophthalmus viridescens', 'Eastern newt', 'Triturus viridescens', 'Rana sylvatica', 'Wood frog']
/references=[Reference(title='Comparative genomic analyses of frog virus 3, type species of the genus Ranavirus (family Iridoviridae).', ...)]
/comment_function=['Transcription activation.']
/proteinExistence=['predicted']
/keywords=['Activator', 'Complete proteome', 'Reference proteome', 'Transcription', 'Transcription regulation']
/type=['ECO:0000305']
/key=['1']
/sequence_length=256
/sequence_mass=29735
/sequence_checksum=B4840739BF7D4121
/sequence_modified=2004-07-19
/sequence_version=1
Seq('MAFSAEDVLKEYDRRRRMEALLLSLYYPNDRKLLDYKEWSPPRVQVECPKAPVE...TPL', ProteinAlphabet())

I want to implement an efficient function to store only the required attributes from each record in the UniProt file, which is massive.

So for example in the above protein, I'd want the ID as the main identifier, with the corresponding (since I am using this word, something tells me I should create a dictionary for this?) Name, taxonomy, sequence_length. However I am unsure on if I create this new data set of objects if I can still treat it as if it is a SeqRecord object?

I have read through the SeqRecord class documentation but it doesn't quite cover what I am trying to do so I just need a little bit of guidance with this. Sorry if it's obvious, still getting my head around some of this!

Edit: Have now got

def load_uniprot_records():
    file = "/Users/john/workspace/practical-2/resources/uniprot_sprot_small.xml.gz"
    seq_records = []

    handle = gzip.open(file)
    for record in SeqIO.parse(handle, "uniprot-xml"):
        seq_record = SeqRecord(seq=record.seq, id=record.id,
                               name=record.name, description=record.description,
                               annotations=record.annotations)
        seq_records.append(seq_record)

    return seq_records

However don't know how to access just the taxonomy part of the annotations dictionary. Having annotations=record.annotations simply assigns every annotation, whereas I just need one or two of them (taxonomy, sequence_length). Any ideas?

I'd convert `xml` to `json`. There's a [lib](https://github.com/martinblech/xmltodict) for that. Parsing `xml` is pain. — Piotr Rarus, Dec 09 '19 at 08:22

sammy · Answer 1 · 2019-12-09T08:13:24.060

1

Yeah, SeqRecords are very complete sequence representations, but not very efficient. The most efficient data structure for your purpose is probably a simple dictionary, where the ID is key, and the value is a tuple with the fixed positions (name, taxonomy, seq_length) (you could also use a namedtuple for that, but their performance is slightly worse, I believe). Then you don't even need to store the SeqRecord, you can just extract the relevant information and then discard it:

seq_records = {}  # keys remain ordered since Python 3.7 :)
for record in SeqIO.parse(handle, "uniprot-xml"):
    name = record.name
    description = record.description
    seq_length = len(record)
    seq_records[record.id] = (name, description, seq_length)

(I don't remember the exact attribute names, you can get them using dir(record) or here, but you get the idea.) This will enable a very efficient data handling later. Do you also need an efficient parsing of the file? There might be other, faster, ways than using SeqIO, like ElementTree (How do I parse XML in Python?), but unless it's a time-critical step, I wouldn't bother.

edited Dec 09 '19 at 08:13

answered Dec 05 '19 at 21:20

sammy

857
5
13

I was thinking that. How would I go about just extracting those attributes and populating the dictionary? I think I need just the one loop to go over all the records in the UniProt db. However I need to minimise the amount of data because of memory. Edit: Yeah so I need to iterate record by record and just pull out the stuff I need and put it in the dictionary. I have to use SeqIO for this. – Johnboy Dec 05 '19 at 21:24
Can you check my updated post and let me know if you know how to access the annotations dictionary? I found a way to create my own SeqRecord and I will just not include the stuff I don't need. Still not too efficient but it's better than having all the stuff I'm not interested in filtered out. – Johnboy Dec 05 '19 at 22:18
This would work, even though a dictionary with tuples is probably (substantially) faster, just look at this benchmark: https://stackoverflow.com/questions/49419132/namedtuple-slow-compared-to-tuple-dictionary-class. What do you mean by accessing the annotations dictionary? I don't think I understand your question. – sammy Dec 05 '19 at 23:46

How to efficiently read large XML file and create custom objects? (Biopython SeqIO)

1 Answers1