4

I'm new to xml parsing, and I've been trying to figure out a way to skip over a parent element's contents because there is a nested element that contains a large amount of data in its text attribute (I cannot change how this file is generated). Here's an example of what the xml looks like:

<root>
    <Parent>
        <thing_1>
            <a>I need this</a>
        </thing_1>
        <thing_2>
            <a>I need this</a>
        </thing_2>
        <thing_3>
            <subgroup>
                <huge_thing>enormous string here</huge_thing>
            </subgroup>
        </thing_3>
    </Parent>
    <Parent>
        <thing_1>
            <a>I need this</a>
        </thing_1>
        <thing_2>
            <a>I need this</a>
        </thing_2>
        <thing_3>
            <subgroup>
                <huge_thing>enormous string here</huge_thing>
            </subgroup>
        </thing_3>
    </Parent>
</root>

I've tried lxml.iterparse and xml.sax implementations to try and work this out, but no dice. These are the majority of the answers I've found in my searches:

  1. Use the tag keyword in iterparse.

    This does not work, because, although lxml cleans up the elements in the background, the large text in the element is still parsed into memory, so I'm getting large memory spikes.

  2. Create a flag where you set it to True if the start event for that element is found and then ignore the element in parsing.

    This does not work, as the element is still parsed into memory at the end event.

  3. Break before you reach the end event of the specific element.

    I cannot just break when I reach the element, because there are multiples of these elements that I need specific children data from.

  4. This is not possible as stream parsers still have an end event and generate the full element.

    ... ok.

I'm currently trying to directly edit the stream data that the GzipFile sends to iterparse in hopes that it would be able to not even know that the element exists, but I'm running into issues with that. Any direction would be greatly appreciated.

Sang
  • 88
  • 5

2 Answers2

3

I don't think you can get a parser to selectively ignore some part of the XML it's parsing. Here are my findings using the SAX parser...

I took your sample XML, blew it up to just under 400MB, created a SAX parser, and ran it against my big.xml file two different ways.

  • For the straightforward approach, sax.parse('big.xml', MyHandler()), memory peaked at 12M.
  • For a buffered file reader approach, using 4K chunks, parser.feed(chunk), memory peaked at 10M.

I then doubled the size, for an 800M file, re-ran both ways and the peak memory usage didn't change, ~10M. The SAX parser seems very effecient.

I ran this script against your sample XML to create some really big text nodes, 400M each.

with open('input.xml') as f:
    data = f.read()

with open('big.xml', 'w') as f:
    f.write(data.replace('enormous string here', 'a'*400_000_000))

Here's big.xml's size in MB:

du -ms big.xml 
763     big.xml

Heres's my SAX ContentHandler which only handles the character data if the path to the data's parent ends in thing_*/a (which according to your sample disqualifies huge_thing)...

BTW, much appreciation to l4mpi for this answer, showing how to buffer the character data you do want:

from xml import sax

class MyHandler(sax.handler.ContentHandler):
    def __init__(self):
        self._charBuffer = []

        self._path = []

    def _getCharacterData(self):
        data = ''.join(self._charBuffer).strip()
        self._charBuffer = []
        return data.strip()  # remove strip() if whitespace is important

    def characters(self, data):
        if len(self._path) < 2:
            return

        if self._path[-1] == 'a' and self._path[-2].startswith('thing_'):
            self._charBuffer.append(data)

    def startElement(self, name, attrs):
        self._path.append(name)

    def endElement(self, name):
        self._path.pop()

        if len(self._path) == 0:
            return

        if self._path[-1].startswith('thing_'):
            print(self._path[-1])
            print(self._getCharacterData())

For both the whole-file parse method, and the chunked reader, I get:

thing_1
I need this
thing_2
I need this
thing_3

thing_1
I need this
thing_2
I need this
thing_3

It's printing thing_3 because of my simple logic, but the data in subgroup/huge_thing is ignored.

Here's how I call the handler with the straight-forward parse() method:

handler = MyHandler()
sax.parse('big.xml', handler)

When I run that with Unix/BSD time, I get:

/usr/bin/time -l ./main.py
...
        1.45 real         0.64 user         0.11 sys
...
            11027456  peak memory footprint

Here's how I call the handler with the more complex chunked reader, using a 4K chunk size:

handler = MyHandler()
parser = sax.make_parser()
parser.setContentHandler(handler)

Chunk_Sz = 4096
with open('big.xml') as f:
    chunk = f.read(Chunk_Sz)
    while chunk != '':
        parser.feed(chunk)
        chunk = f.read(Chunk_Sz)
/usr/bin/time -l ./main.py
...
        1.85 real         1.65 user         0.19 sys
...
            10453952  peak memory footprint

Even with a 512B chunk size, it doesn't get below 10M, but the runtime doubled.

I'm curious to see what kind of performance you're getting.

Zach Young
  • 10,137
  • 4
  • 32
  • 53
0

You cannot use a DOM parser as that would per definition stuff the whole document into RAM. But basically a DOM parser is just a SAX parser that creates a DOM as it goes through the SAX events.

When creating your custom SAX parser you can actually not just create the DOM (or whichever other memory represenation you prefer) but start ignoring events should they relate to some specific location in the document.

Be aware the parsing needs to continue so you know when to stop ignoring the events. But the output of the parser would not contain this unneeded large chunk of data.

Queeg
  • 7,748
  • 1
  • 16
  • 42