1

I have an XML document (1.5MB) that needs to be parsed in real-time for a web service that I am developing. I am using the cElementTree Python library which, according to this post, is the preferred way to parse XML in Python, but I'm not sure if this is actually the fastest way.

I would like to increase parsing performance as well as minimise the memory usage on my server and am testing the SAX approach with ET.iterparse() at the moment. My benchmark shows the following results for purely parsing the same XML document 200 times.

  • DOM with ET.XML(): 20.5s
  • SAX with ET.iterparse(): 32.4s

Which is equivalent to roughly 102ms for the DOM versus 162ms for the SAX per XML document.

However, I would still like to squeeze more performance out of the SAX approach to match the 102ms of DOM or possibly go even faster, as performance and memory are both critical in my application.

I am using one of the common ways for SAX parsing like the code below:

from cStringIO import StringIO
import xml.etree.cElementTree as ET

def parse(xml_string):  
    result = []
    io = StringIO(xml_string)
    context = ET.iterparse(io, events=("start", "end"))
    for event, elem in context:
        tag = elem.tag
        value = elem.text

        if event == 'end':
            # get value from element and add to result[]
            pass

            elem.clear()

    return result
Community
  • 1
  • 1
arimbun
  • 3,885
  • 2
  • 19
  • 16
  • I would like to parse XMLs of similar size constantly in my application. The application constantly queries external web services (there can be 10-15 at once, and this number is growing) and returns XML documents. As the application may be concurrently used by many different users, the number of XML documents that reside in memory may increase as I add more external web services and users to the application. The idea is that I would like to save memory in the long run. – arimbun Jun 14 '13 at 04:49
  • Yes unfortunately, using `fast_iter` with `lxml` in my case actually slows it down to about 39s... I will try some other approach to see if I can get any better than what I have. – arimbun Jun 14 '13 at 06:22

0 Answers0