What is the fastest way to parse large XML docs in Python?

Question

I am currently running the following code based on Chapter 12.5 of the Python Cookbook:

from xml.parsers import expat

class Element(object):
    def __init__(self, name, attributes):
        self.name = name
        self.attributes = attributes
        self.cdata = ''
        self.children = []
    def addChild(self, element):
        self.children.append(element)
    def getAttribute(self,key):
        return self.attributes.get(key)
    def getData(self):
        return self.cdata
    def getElements(self, name=''):
        if name:
            return [c for c in self.children if c.name == name]
        else:
            return list(self.children)

class Xml2Obj(object):
    def __init__(self):
        self.root = None
        self.nodeStack = []
    def StartElement(self, name, attributes):
        element = Element(name.encode(), attributes)
        if self.nodeStack:
            parent = self.nodeStack[-1]
            parent.addChild(element)
        else:
            self.root = element
        self.nodeStack.append(element)
    def EndElement(self, name):
        self.nodeStack.pop()
    def CharacterData(self,data):
        if data.strip():
            data = data.encode()
            element = self.nodeStack[-1]
            element.cdata += data
    def Parse(self, filename):
        Parser = expat.ParserCreate()
        Parser.StartElementHandler = self.StartElement
        Parser.EndElementHandler = self.EndElement
        Parser.CharacterDataHandler = self.CharacterData
        ParserStatus = Parser.Parse(open(filename).read(),1)
        return self.root

I am working with XML documents of about 1 GB in size. Does anyone know a faster way to parse these?

Your question is much too vague to glean any useful answers. Consider answering these questions: - What are you trying to do with this 1GB XML doc? - How fast do you need this parser to be? - Could you lazily iterate through the document, rather than loading everything into memory from the get go? — Matt, Nov 27 '08 at 17:15
I need to load it all into memory, index the data and then 'browse' and process it. — Jeroen Dirks, Nov 27 '08 at 21:33

Steen · Accepted Answer · 2022-12-16T16:14:34.520

79

I looks to me as if you do not need any DOM capabilities from your program. I would second the use of the (c)ElementTree library. If you use the iterparse function of the cElementTree module, you can work your way through the xml and deal with the events as they occur.

Note however, Fredriks advice on using cElementTree iterparse function:

to parse large files, you can get rid of elements as soon as you’ve processed them:

for event, elem in iterparse(source):
    if elem.tag == "record":
        ... process record elements ...
        elem.clear()

The above pattern has one drawback; it does not clear the root element, so you will end up with a single element with lots of empty child elements. If your files are huge, rather than just large, this might be a problem. To work around this, you need to get your hands on the root element. The easiest way to do this is to enable start events, and save a reference to the first element in a variable:

# get an iterable
context = iterparse(source, events=("start", "end"))

# turn it into an iterator
context = iter(context)

# get the root element
event, root = context.next()

for event, elem in context:
    if event == "end" and elem.tag == "record":
        ... process record elements ...
        root.clear()

The lxml.iterparse() does not allow this.

The previous does not work on Python 3.7, consider the following way to get the first element.

import xml.etree.ElementTree as ET

# Get an iterable.
context = ET.iterparse(source, events=("start", "end"))
    
for index, (event, elem) in enumerate(context):
    # Get the root element.
    if index == 0:
        root = elem
    if event == "end" and elem.tag == "record":
        # ... process record elements ...
        root.clear()

edited Dec 16 '22 at 16:14

answered Nov 28 '08 at 20:03

Steen

6,573
3
39
56

2

Changing `context.next()` to `context.__next__()` fixed the second example for me (Python 3). – Eric Reed May 12 '20 at 23:30
can't get why do we need to do root.clear() in 3rd example taking in attention that it's not used anywhere? – soulrider Jun 24 '21 at 06:37
@soulrider In order to let the interpreter know that it can garbage collect the nodes that have been iterated, since we don't need them anymore. You have to clear from the `root` node, otherwise references will be retained in the context – Steen Aug 05 '21 at 11:00
Notes (that were hard for me to understand) ---> 1---> you don't get the hierarchy that way, only a long list of the open and close of the fields - better to understand the hierarchy yourself with a small example first and then parse the file. 2---> when event ==start and event ==end, you get exactly the same data... – Alon Samuel Mar 10 '22 at 16:36
The following works in Python 3.9: parser = ET.iterparse(stream, events = ("end", )) event, root = next(parser) result.handleElement(root) for event, element in parser: # .. process element here ... root.clear() – J. Beattie May 13 '22 at 16:06
@J.Beattie what is `result.handleElement(root)` ? – AhmadDeel Nov 23 '22 at 06:18
@AhmadDeel The fragment I posted in the comment above was a stripped down copy of working code. The line "result.handleElement(root)" should also have been stripped out from the code I posted. Please disregard this line. – J. Beattie Nov 24 '22 at 17:34
In these code fragments, what is `source`? The links have all rotted away. – MikeB Nov 29 '22 at 02:25
@MikeB: Just learned that Fredrik Lundh died a year ago, so his website is no more :( I've updated the links from webarchive – Steen Dec 16 '22 at 16:17

score 17 · Answer 2 · edited Feb 11 '22 at 11:52

17

Have you tried the cElementTree module?

cElementTree is included with Python 2.5 and later, as xml.etree.cElementTree. Refer the benchmarks.

Note that since Python 3.3 cElementTree is used as the default implementation so this change is not needed with a Python version 3.3+.

removed dead ImageShack link

edited Feb 11 '22 at 11:52

Augustin

2,444
23
24

answered Nov 27 '08 at 19:00

bhadra

12,887
10
54
47

score 11 · Answer 3 · edited Jan 28 '22 at 15:33

I recommend you to use lxml, it's a python binding for the libxml2 library which is really fast.

In my experience, libxml2 and expat have very similar performance. But I prefer libxml2 (and lxml for python) because it seems to be more actively developed and tested. Also libxml2 has more features.

lxml is mostly API compatible with xml.etree.ElementTree. And there is good documentation in its web site.

Aaron Digulla · Answer 4 · 2008-11-28T08:16:56.833

7

Registering callbacks slows down the parsing tremendously. [EDIT]This is because the (fast) C code has to invoke the python interpreter which is just not as fast as C. Basically, you're using the C code to read the file (fast) and then build the DOM in Python (slow).[/EDIT]

Try to use xml.etree.ElementTree which is implemented 100% in C and which can parse XML without any callbacks to python code.

After the document has been parsed, you can filter it to get what you want.

If that's still too slow and you don't need a DOM another option is to read the file into a string and use simple string operations to process it.

edited Nov 28 '08 at 08:16

answered Nov 27 '08 at 16:56

Aaron Digulla

321,842
108
597
820

This is very misleading advice. There is nothing about a callback-based XML parser that is intrinsically slow. Moreover, the OP is already using Python's expat bindings, which are also native C. – Matt Nov 27 '08 at 17:13
2

The python interpreter is always slower than natively compiled C code. And as you can clearly see in the code in the question, it's registering Python code to be called for every element! And this code does a lot of work, too! – Aaron Digulla Nov 28 '08 at 08:19
2

This should be upped, callbacks in python are really slow, you want to avoid that and do as much as possible in C land. – Johan Dahlin Nov 28 '08 at 15:18

score 5 · Answer 5 · answered Nov 27 '08 at 21:30

If your application is performance-sensitive and likely to encounter large files (like you said, > 1GB) then I'd strongly advise against using the code you're showing in your question for the simple reason that it loads the entire document into RAM. I would encourage you to rethink your design (if at all possible) to avoid holding the whole document tree in RAM at once. Not knowing what your application's requirements are, I can't properly suggest any specific approach, other than the generic piece of advice to try to use an "event-based" design.

Thanks! I'm trying to do a similar thing, but and trying to use iterparse by elemntTree, but the Ram is still going up, you know why? — Alon Samuel, Mar 10 '22 at 17:47

score 1 · Answer 6 · answered Nov 19 '15 at 22:50

expat ParseFile works well if you don't need to store the entire tree in memory, which will sooner or later blow your RAM for large files:

import xml.parsers.expat
parser = xml.parsers.expat.ParserCreate()
parser.ParseFile(open('path.xml', 'r'))

It reads the files into chunks, and feeds them to the parser without exploding RAM.

Doc: https://docs.python.org/2/library/pyexpat.html#xml.parsers.expat.xmlparser.ParseFile

score 1 · Answer 7 · answered May 09 '19 at 22:42

I spent quite some time trying this out and it seems the fastest and the least memory intensive approach is using lxml and iterparse, but making sure to free unneeded memory. In my example, parsing arXiv dump:

from lxml import etree

context = etree.iterparse('path/to/file', events=('end',), tag='Record')

for event, element in context:
    record_id = element.findtext('.//{http://arxiv.org/OAI/arXiv/}id')
    created = element.findtext('.//{http://arxiv.org/OAI/arXiv/}created')

    print(record_id, created)

    # Free memory.
    element.clear()
    while element.getprevious() is not None:
        del element.getparent()[0]

So element.clear is not enough, but also the removal of any links to previous elements.

score 0 · Answer 8 · answered Jan 11 '22 at 07:49

In Python3 you should change the syntax
Instead of this

# get the root element
event, root = context.next()

Try this (like recommended in Iterparse object has no attribute next)

# get the root element
event, root = next(context)

And this line is unnecessary

# turn it into an iterator
context = iter(context)

What is the fastest way to parse large XML docs in Python?

8 Answers8

Linked

Related