I have an application that deals with ~1-2 megabyte XML files. Doesn't sound like much, but I've run into a performance problem nonetheless.
Since I've some compute bound tasks that I'd like to speed up I've tried using multiprocessing.imap to do that - which requires pickling this XML data. Pickling the datastructures containing references into this DOM turns out to be slower than those compute bound processes, and the culprit seems to be recursions - I had to set the recursion limit to 10'000 in order to get pickle to work in the first place :-S.
Anyways, my question is:
If I wanted to attack this problem from the referential performance angle, what should I replace minidom with? Criterias are both pickling performance but also ease of transition.
To give you an idea of what kind of methods are needed, I have pasted a wrapper class (written sometimes earlier in order to speed up getElementsByTagName
calls). It would be acceptable to replace all minidom nodes with nodes that adhere to the same interface as this class, i.e. I don't need all the methods from minidom. Getting rid of the parentNode
method would also be acceptable (and probably a good idea in order to improve pickling performance).
And yes, if I'd be designing this nowadays I wouldn't go for XML node references in the first place, but it would be a lot of work to rip all of this out now, so I hope this can be patched instead.
Should I just write the damn thing myself using python built-ins or the collections library?
class ImmutableDOMNode(object):
def __init__(self, node):
self.node = node
self.cachedElementsByTagName = {}
@property
def nodeType(self):
return self.node.nodeType
@property
def tagName(self):
return self.node.tagName
@property
def ownerDocument(self):
return self.node.ownerDocument
@property
def nodeName(self):
return self.node.nodeName
@property
def nodeValue(self):
return self.node.nodeValue
@property
def attributes(self):
return self.node.attributes
@property
def parentNode(self):
return ImmutableDOMNode(self.node.parentNode)
@property
def firstChild(self):
return ImmutableDOMNode(self.node.firstChild)
@property
def childNodes(self):
return [ImmutableDOMNode(node) for node in self.node.childNodes]
def getElementsByTagName(self, name):
result = self.cachedElementsByTagName.get(name)
if result != None:
return result
uncachedResult = self.node.getElementsByTagName(name)
cachedResult = [ImmutableDOMNode(node) for node in uncachedResult]
self.cachedElementsByTagName[name] = cachedResult
return cachedResult
def getAttribute(self, qName):
return self.node.getAttribute(qName)
def toxml(self, encoding=None):
return self.node.toxml(encoding)
def toprettyxml(self, indent="", newl="", encoding=None):
return self.node.toprettyxml(indent, newl, encoding)
def appendChild(self, node):
raise Exception("cannot append child to immutable node")
def removeChild(self, node):
raise Exception("cannot remove child from immutable node")
def cloneNode(self, deep):
raise Exception("clone node not implemented")
def createElement(self, tagName):
raise Exception("cannot create element for immutable node")
def createTextNode(self, tagName):
raise Exception("cannot create text node for immutable node")
def createAttribute(self, qName):
raise Exception("cannot create attribute for immutable node")