1

I need to parse very large XML files (in the range of 3-5GB), which must split into several smaller XML files according to data included in XML nodes.

Each input file includes several hundred thousand <measure> elements, like in this (very) simplified fragment.

    <items>
        <measure code="0810">
            <condition sequ="001" SID="-5041162"/>
            <footnote Id="00550"/>
            <footnote Id="00735"/>
        </measure>
        <measure code="6304">
            <component Id="01" national="1"/>
            <footnote Id="00001"/>
        </measure>
        <measure code="0811">
            <condition sequ="002" SID="-5041356"/>
            <footnote Id="00555"/>
        </measure>
        <measure code="2915">
            <component Id="01" national="0"/>
            <certif SID="-737740"/>
            <certif SID="-737780"/>
        </measure>
    </items>

The content of the actual <measure> elements can be almost any well-formed XML.

I need to do two processes while parsing these files:

  1. Extract information from the content of <measure> elements, and dump it to a MongoDB database (this part is solved...)
  2. Partition the original XML file into, say 100, XML subfiles based on the first two digits of the "code" attribute of each <measure> node. That is, new 100 XML files (named 'part_00.xml' to 'part_99.xml') need to be created and each <measure> element must be appended to the corresponding subfile. I.e. <measure> blocks 1 and 3 in the sample should be copied to 'part_08.xml', block 2 should be copied to 'part_63.xml'...

I'm using SAX to parse the original files, and process 1 above runs nicely. The pure skeleton of the SAX process is:

    import sys
    from xml.sax import ContentHandler
    from xml.sax import make_parser

    class ParseMeasures(ContentHandler):
        code = ''

        def startElement(self, name, attrs):
            if name == 'measure':
                self.code = attrs.get('code')

        def endElement(self, name):
            if name == 'measure':
                print('***Must append <measure> block to file part_{0}.xml'.format(self.code[:2]))

    def main(args):
        handler = ParseMeasures()
        sax_parser = make_parser()
        sax_parser.setContentHandler(handler)
        sax_parser.parse('my_large_xml.file.xml')
        print('Ended')

    if __name__ == '__main__':
        main(sys.argv[1:])

What I would need is to be able to access the whole <measure> XML element, in 'endElement()', to append it to the corresponding subfile.

Is there a way to combine SAX with other XML parsing functionality, that will allow to obtain the whole <measure> XML element in 'endElement()'? (I can handle the creation and management of the subfiles... This is not the problem!)

Or maybe the SAX approach is not the most adequate in this situation, to start with?

The "only" caveat is that the process should handle input files in the range of 3-5GB...

JPeraita
  • 113
  • 1
  • 7
  • your XML has a very simple structure. why not just read file row by row (or by bytes chunks) and use regex every time when you caught close "measure" tag? – T'East Oct 28 '19 at 12:15
  • 1
    Instead of SAX you probably want to use the lesser known XML Pull Parser in xml.dom.pulldom: "[The xml.dom.pulldom module provides a “pull parser” which can also be asked to produce DOM-accessible fragments of the document where necessary.](https://docs.python.org/3/library/xml.dom.pulldom.html)" This is also known as StAX parsing. – kthy Oct 28 '19 at 12:26
  • @T'East The simplified example has a very simple structure indeed! Unfortunately, the actual structure of the real "measure" blocks can be very complex, including a large number of optional attributes, in non-fixed order, line breaks in its content, etc. – JPeraita Oct 28 '19 at 12:57

2 Answers2

2

The following is a hybrid solution that uses the built-in SAX parser to generate parsing events, and lxml to build partial trees (<measure> elements only, and only one at at time).

Once the element has been built, it is serialized by lxml's API for incremental XML generation into different files, depending on the value of @code.

This code handles any level of nesting inside the <measure> elements, and text values including whitespace. It does not currently handle comments, processing instructions, or namespaces, but support for those could be added.

Memory consumption should remain low even with large input files. lxml will add some overhead, but the iterative writing support is quite convenient. Doing this completely manually would be faster overall, but also be more involved.

from xml.sax import ContentHandler, make_parser
from lxml import etree

class ParseMeasures(ContentHandler):
    def __init__(self):
        self.stack = []
        self.open = False
        self.elem = None
        self.writers = {}
        self.text = []

    def _get_writer(self, filename):
        with etree.xmlfile(filename) as xf:
            with xf.element('items'):
                while True:
                    el = (yield)
                    xf.write(el)
                    xf.flush()  # maybe don't flush *every* write

    def _write(self):
        grp = self.elem.attrib['code'][0:2]

        if grp in self.writers:
            writer = self.writers[grp]
        else:
            writer = self.writers[grp] = self._get_writer('part_%s.xml' % grp)
            next(writer)        # run up to `yield` and wait

        writer.send(self.elem)  # write out current `<measure>`
        self.elem = None

    def _add_text(self):
        if self.elem is not None and self.text:
            if self.open:
                self.elem.text = ''.join(self.text)
            else:
                self.elem.tail = ''.join(self.text)
            self.text = []

    def startElement(self, name, attrib):
        if self.stack or name == 'measure':
            self._add_text()
            self.open = True
            self.elem = etree.Element(name, attrib)
            self.stack.append(self.elem)
            if len(self.stack) > 1:
                self.stack[-2].append(self.elem)

    def characters(self, content):
        if self.elem is not None:
            self.text.append(content)

    def endElement(self, name):
        if self.stack:
            self._add_text()
            self.open = False
            self.elem = self.stack.pop()            
            if not self.stack:
                self._write()

    def endDocument(self):
        # clean up
        for writer in self.writers:
            self.writers[writer].close()


def main():
    sax_parser = make_parser()
    sax_parser.setContentHandler(ParseMeasures())
    sax_parser.parse(r'test.xml')

if __name__ == '__main__':
    main()

This generates part_08.xml

<items>
    <measure code="0810">
        <condition sequ="001" SID="-5041162"/>
        <footnote Id="00550"/>
        <footnote Id="00735"/>
    </measure>
    <measure code="0811">
        <condition sequ="002" SID="-5041356"/>
        <footnote Id="00555"/>
    </measure>
</items>

and part_29.xml

<items>
    <measure code="2915">
        <component Id="01" national="0"/>
        <certif SID="-737740"/>
        <certif SID="-737780"/>
    </measure>
</items>

and part_63.xml

<items>
    <measure code="6304">
        <component Id="01" national="1"/>
        <footnote Id="00001"/>
    </measure>
</items>
Tomalak
  • 332,285
  • 67
  • 532
  • 628
  • An explanation of the `el = (yield)` syntax is over here: https://stackoverflow.com/a/32128704/ – Tomalak Oct 28 '19 at 14:23
  • Thanks. However, it seems that etree.iterparse is trying to ingest the whole file. I tried it on a medium size ('only' 2GB) file and memory consumption skyrocketed, resulting in a system crash before reaching the 'for'... – JPeraita Oct 28 '19 at 14:29
  • Hm, you're right, `.iterparse()` of course still builds the whole document tree as it goes along, I had not considered that. Give me a moment. – Tomalak Oct 28 '19 at 15:07
  • @JPeraita I've updated the answer to address this. Tell me how it works for you. – Tomalak Oct 28 '19 at 16:45
  • Haven´t tried it yet... But `self.elem = etree.Element(name, attrib)` seems to point to what I was searching! – JPeraita Oct 28 '19 at 17:29
  • Wow, one whole line out of everything I wrote? :D – Tomalak Oct 28 '19 at 17:31
  • In any case, give it a try. It should be quite a bit of the way, it might even already be good enough. – Tomalak Oct 28 '19 at 17:37
  • Don't take me wrong! Building the element's content into a stack is a _very_ clever idea, but using etree.Element() to add content is just what does the trick! The 'get_writer()' function is also a very useful solution to handle all the detail files. I've tested your code and it works nicely... I now only need to adapt it to the real environment: using SAX to parse inner content of "measure" and dealing with namespaces... – JPeraita Oct 29 '19 at 09:04
  • The inner content of `` is already handled by this code to arbitrary levels of nesting. Default namespaces are handled implicitly. Prefixed namespaces must be mapped manually, because they come out of SAX as `prefix:name`, which is not legal in `lxml`, which expects `{namespace-uri}name` and a `nsmap` dict, where prefixes are defined. Not too difficult to do, but it must be applied to both element names and attribute names. – Tomalak Oct 29 '19 at 10:44
  • Let us [continue this discussion in chat](https://chat.stackoverflow.com/rooms/201541/discussion-between-jperaita-and-tomalak). – JPeraita Oct 29 '19 at 11:19
1

Below (the output is 3 files having the names 'part_zz.xml')

Note that the solution below does not use any external library.

import sys
import xml.etree.ElementTree as ET
from collections import defaultdict
from xml.sax import ContentHandler
from xml.sax import make_parser


class ParseMeasures(ContentHandler):
    def __init__(self):
        self.data = defaultdict(list)
        self.code = None

    def startElement(self, name, attrs):
        if name == 'measure':
            self.code = attrs.get('code')[:2]
        if self.code:
            self.data[self.code].append((name, attrs._attrs))

    def endDocument(self):
        for k, v in self.data.items():
            root = None
            for entry in v:
                if entry[0] == 'measure':
                    if not root:
                        root = ET.Element('items')
                    measure = ET.SubElement(root, 'measure')
                temp = ET.SubElement(measure, entry[0])
                temp.attrib = entry[1]
            tree = ET.ElementTree(root)
            tree.write('part_{}.xml'.format(k), method='xml')


def main(args):
    handler = ParseMeasures()
    sax_parser = make_parser()
    sax_parser.setContentHandler(handler)
    sax_parser.parse('my_large_xml.file.xml')


if __name__ == '__main__':
    main(sys.argv[1:])

Output example ('part_08.xml')

<items>
    <measure>
        <measure code="0810"/>
        <condition SID="-5041162" sequ="001"/>
        <footnote Id="00550"/>
        <footnote Id="00735"/>
    </measure>
    <measure>
        <measure code="0811"/>
        <condition SID="-5041356" sequ="002"/>
        <footnote Id="00555"/>
    </measure>
</items>
balderman
  • 22,927
  • 7
  • 34
  • 52
  • This neglects writing any text content (the OP stated their XML is more complex than the sample). – Tomalak Oct 28 '19 at 16:02
  • I agree that the code supports only the XML that was posted by the OP. If a more complex XML will be submitted by him the code will be modified in order to support it. – balderman Oct 28 '19 at 16:05
  • Also this build a whole `tree` in memory, which could be quite wasteful/slow given the OP's input XML size. – Tomalak Oct 28 '19 at 16:58
  • Indeed, I just stumbled upon a 6GB input file. I fear that trying to feed that into a dictionary might be suboptimal. OTOH, the contents of "measure" elements can be almost any well-formed XML. – JPeraita Oct 28 '19 at 17:22