You might find that a SAX parser works better for this task. Rather than building a DOM, a SAX parser turns the XML file into a stream of elements and calls functions you provide to let you handle each element.
The good thing is that SAX parsers can be very fast and memory-efficient compared to DOM parsers, and some don't even need to be given all the XML at once, which would be ideal when you have 25 GB of it.
Unfortunately, if you need any context information, like "I want tag <B>
but only if it's inside tag <A>
," you must maintain it yourself, since all the parser gives you is "start tag <A>
, start tag <B>
, end tag <B>
, end tag <A>
." It never explicitly tells you that tag <B>
is inside tag <A>
, you have to figure that out from what you saw. And once you have seen an element, it's gone unless you remembered it yourself.
This gets very hairy for complex parsing jobs, but yours is probably manageable.
It happens that Python's standard library has a SAX parser in xml.sax
. You probably want something like xml.sax.xmlreader.IncrementalParser
.