I'm new to xml parsing, and I've been trying to figure out a way to skip over a parent element's contents because there is a nested element that contains a large amount of data in its text attribute (I cannot change how this file is generated). Here's an example of what the xml looks like:
<root>
<Parent>
<thing_1>
<a>I need this</a>
</thing_1>
<thing_2>
<a>I need this</a>
</thing_2>
<thing_3>
<subgroup>
<huge_thing>enormous string here</huge_thing>
</subgroup>
</thing_3>
</Parent>
<Parent>
<thing_1>
<a>I need this</a>
</thing_1>
<thing_2>
<a>I need this</a>
</thing_2>
<thing_3>
<subgroup>
<huge_thing>enormous string here</huge_thing>
</subgroup>
</thing_3>
</Parent>
</root>
I've tried lxml.iterparse and xml.sax implementations to try and work this out, but no dice. These are the majority of the answers I've found in my searches:
Use the tag keyword in iterparse.
This does not work, because, although lxml cleans up the elements in the background, the large text in the element is still parsed into memory, so I'm getting large memory spikes.
Create a flag where you set it to True if the start event for that element is found and then ignore the element in parsing.
This does not work, as the element is still parsed into memory at the end event.
Break before you reach the end event of the specific element.
I cannot just break when I reach the element, because there are multiples of these elements that I need specific children data from.
This is not possible as stream parsers still have an end event and generate the full element.
... ok.
I'm currently trying to directly edit the stream data that the GzipFile sends to iterparse in hopes that it would be able to not even know that the element exists, but I'm running into issues with that. Any direction would be greatly appreciated.