I have an XML document (1.5MB) that needs to be parsed in real-time for a web service that I am developing. I am using the cElementTree Python library which, according to this post, is the preferred way to parse XML in Python, but I'm not sure if this is actually the fastest way.
I would like to increase parsing performance as well as minimise the memory usage on my server and am testing the SAX approach with ET.iterparse() at the moment. My benchmark shows the following results for purely parsing the same XML document 200 times.
- DOM with ET.XML(): 20.5s
- SAX with ET.iterparse(): 32.4s
Which is equivalent to roughly 102ms for the DOM versus 162ms for the SAX per XML document.
However, I would still like to squeeze more performance out of the SAX approach to match the 102ms of DOM or possibly go even faster, as performance and memory are both critical in my application.
I am using one of the common ways for SAX parsing like the code below:
from cStringIO import StringIO
import xml.etree.cElementTree as ET
def parse(xml_string):
result = []
io = StringIO(xml_string)
context = ET.iterparse(io, events=("start", "end"))
for event, elem in context:
tag = elem.tag
value = elem.text
if event == 'end':
# get value from element and add to result[]
pass
elem.clear()
return result