I'm trying to parse a really large (>1GB) XML file with python's lxml etree.
I'm following the directions found in this post: Using Python Iterparse For Large XML Files
I want to efficiently locate an element based on its id attribute and retrieve the contents.
What is the most memory-efficient and fast way to do it, assuming I can't hold the document in memory?
Adding code to be clear:
doc_id = "[XXX]"
with gzip.open(file_path, 'r') as xml_file:
context = etree.iterparse(xml_file, tag='MyDocument', events=('end',))
for event, xml_doc in context:
xml_doc_id = xml_doc.get('id')
# If I could only catch this on the start event
# and tell lxml not to bother with building this element
if doc_id != xml_doc_id:
xml_doc.clear()
continue
data = parse(xml_doc)
xml_doc.clear()
return data
Thank you