Efficiently filtering XML elements using python's iterparse

Question

I'm trying to parse a really large (>1GB) XML file with python's lxml etree.

I'm following the directions found in this post: Using Python Iterparse For Large XML Files

I want to efficiently locate an element based on its id attribute and retrieve the contents.

What is the most memory-efficient and fast way to do it, assuming I can't hold the document in memory?

Adding code to be clear:

 doc_id = "[XXX]"
 with gzip.open(file_path, 'r') as xml_file:
    context = etree.iterparse(xml_file, tag='MyDocument', events=('end',))
    for event, xml_doc in context:
        xml_doc_id = xml_doc.get('id')
        # If I could only catch this on the start event 
        # and tell lxml not to bother with building this element
        if doc_id != xml_doc_id:
            xml_doc.clear()
            continue
        data = parse(xml_doc)
        xml_doc.clear()
        return data

Thank you

Is there a problem with the directions found in the linked post? — mzjn, May 11 '20 at 11:20
I was wondering if there is an iterative version of the find method that disposes of elements that don't answer to the search criteria, without building the full tree. I started implementing one myself when I realized I want to skip the whole element right on the start event when the attributes are known, without firing the 'end' event and having to build the XML element if the id doesn't match — Paul, May 11 '20 at 13:07

Efficiently filtering XML elements using python's iterparse

0 Answers0