0

I have an application that deals with ~1-2 megabyte XML files. Doesn't sound like much, but I've run into a performance problem nonetheless.

Since I've some compute bound tasks that I'd like to speed up I've tried using multiprocessing.imap to do that - which requires pickling this XML data. Pickling the datastructures containing references into this DOM turns out to be slower than those compute bound processes, and the culprit seems to be recursions - I had to set the recursion limit to 10'000 in order to get pickle to work in the first place :-S.

Anyways, my question is:

If I wanted to attack this problem from the referential performance angle, what should I replace minidom with? Criterias are both pickling performance but also ease of transition.

To give you an idea of what kind of methods are needed, I have pasted a wrapper class (written sometimes earlier in order to speed up getElementsByTagName calls). It would be acceptable to replace all minidom nodes with nodes that adhere to the same interface as this class, i.e. I don't need all the methods from minidom. Getting rid of the parentNode method would also be acceptable (and probably a good idea in order to improve pickling performance).

And yes, if I'd be designing this nowadays I wouldn't go for XML node references in the first place, but it would be a lot of work to rip all of this out now, so I hope this can be patched instead.

Should I just write the damn thing myself using python built-ins or the collections library?

class ImmutableDOMNode(object):
    def __init__(self, node):
        self.node = node
        self.cachedElementsByTagName = {}

    @property
    def nodeType(self):
        return self.node.nodeType

    @property
    def tagName(self):
        return self.node.tagName

    @property
    def ownerDocument(self):
        return self.node.ownerDocument

    @property
    def nodeName(self):
        return self.node.nodeName

    @property
    def nodeValue(self):
        return self.node.nodeValue

    @property
    def attributes(self):
        return self.node.attributes

    @property
    def parentNode(self):
        return ImmutableDOMNode(self.node.parentNode)

    @property
    def firstChild(self):
        return ImmutableDOMNode(self.node.firstChild)

    @property
    def childNodes(self):
        return [ImmutableDOMNode(node) for node in self.node.childNodes]

    def getElementsByTagName(self, name):
        result = self.cachedElementsByTagName.get(name)
        if result != None:
            return result
        uncachedResult = self.node.getElementsByTagName(name)
        cachedResult = [ImmutableDOMNode(node) for node in uncachedResult]
        self.cachedElementsByTagName[name] = cachedResult
        return cachedResult

    def getAttribute(self, qName):
        return self.node.getAttribute(qName)

    def toxml(self, encoding=None):
        return self.node.toxml(encoding)

    def toprettyxml(self, indent="", newl="", encoding=None):
        return self.node.toprettyxml(indent, newl, encoding)

    def appendChild(self, node):
        raise Exception("cannot append child to immutable node")

    def removeChild(self, node):
        raise Exception("cannot remove child from immutable node")

    def cloneNode(self, deep):
        raise Exception("clone node not implemented")

    def createElement(self, tagName):
        raise Exception("cannot create element for immutable node")

    def createTextNode(self, tagName):
        raise Exception("cannot create text node for immutable node")

    def createAttribute(self, qName):
        raise Exception("cannot create attribute for immutable node")
Michel Müller
  • 5,535
  • 3
  • 31
  • 49

1 Answers1

0

So I decided to just make my own DOM implementation that meets my requirements, I've pasted it below in case it helps someone. It depends on lru_cache from memoization library for python 2.7 and @Raymond Hettinger's immutable dict from Immutable dictionary, only use as a key for another dictionary. However, these dependencies are easy to remove if you don't mind less safety/performance.

class CycleFreeDOMNode(object):
    def __init__(self, minidomNode=None):
        if minidomNode is None:
            return
        if not isinstance(minidomNode, xml.dom.minidom.Node):
            raise ValueError("%s needs to be instantiated with a minidom.Node" %(
                type(self).__name__
            ))
        if minidomNode.nodeValue and minidomNode.childNodes:
            raise ValueError(
                "both nodeValue and childNodes in same node are not supported"
            )
        self._tagName = minidomNode.tagName \
            if hasattr(minidomNode, "tagName") else None
        self._nodeType = minidomNode.nodeType
        self._nodeName = minidomNode.nodeName
        self._nodeValue = minidomNode.nodeValue
        self._attributes = dict(
            item
            for item in minidomNode.attributes.items()
        ) if minidomNode.attributes else {}
        self._childNodes = tuple(
            CycleFreeDOMNode(cn)
            for cn in minidomNode.childNodes
        )
        childNodesByTagName = defaultdict(list)
        for cn in self._childNodes:
            childNodesByTagName[cn.tagName].append(cn)
        self._childNodesByTagName = ImmutableDict(childNodesByTagName)

    @property
    def nodeType(self):
        return self._nodeType

    @property
    def tagName(self):
        return self._tagName

    @property
    def nodeName(self):
        return self._nodeName

    @property
    def nodeValue(self):
        return self._nodeValue

    @property
    def attributes(self):
        return self._attributes

    @property
    def firstChild(self):
        return self._childNodes[0] if self._childNodes else None

    @property
    def childNodes(self):
        return self._childNodes

    @lru_cache(maxsize = 100)
    def getElementsByTagName(self, name):
        result = self._childNodesByTagName.get(name, [])
        for cn in self.childNodes:
            result += cn.getElementsByTagName(name)
        return result

    def cloneNode(self, deep=False):
        clone = CycleFreeDOMNode()
        clone._tagName = self._tagName
        clone._nodeType = self._nodeType
        clone._nodeName = self._nodeName
        clone._nodeValue = self._nodeValue
        clone._attributes = copy.copy(self._attributes)
        if deep:
            clone._childNodes = tuple(
                cn.cloneNode(deep)
                for cn in self.childNodes
            )
            childNodesByTagName = defaultdict(list)
            for cn in clone._childNodes:
                childNodesByTagName[cn.tagName].append(cn)
            clone._childNodesByTagName = ImmutableDict(childNodesByTagName)
        else:
            clone._childNodes = tuple(cn for cn in self.childNodes)
            clone._childNodesByTagName = self._childNodesByTagName
        return clone

    def toxml(self):
        def makeXMLForContent():
            return self.nodeValue or "".join([
                cn.toxml() for cn in self.childNodes
            ])

        if not self.tagName:
            return makeXMLForContent()
        return "<%s%s>%s</%s>" %(
            self.tagName,
            " " + ", ".join([
                "%s=\"%s\"" %(k,v)
                for k,v in self.attributes.items()
            ]) if any(self.attributes) else "",
            makeXMLForContent(),
            self.tagName
        )

    def getAttribute(self, name):
        return self._attributes.get(name, "")

    def setAttribute(self, name, value):
        self._attributes[name] = value
Community
  • 1
  • 1
Michel Müller
  • 5,535
  • 3
  • 31
  • 49