34

I've been using minidom to parse XML for years. Now I've suddenly learned about Element Tree. My question which is better for parsing? That is:

  • Which is faster?
  • Which uses less memory?
  • Do either have any O(n^2) dependencies I should worry about?
  • Is one being depreciated in favor of another?

Why do we have two interfaces?

Thanks.

vy32
  • 28,461
  • 37
  • 122
  • 246

2 Answers2

21

DOM and Sax interfaces for XML parsing are the classic ways to work with XML. Python had to provide those interfaces because they are well-known and standard.

The ElementTree package was intended to provide a more Pythonic interface. It is all about making things easier for the programmer.

Depending on your build, each of those has an underlying C implementation that makes them run fast.

None of the above tools is being deprecated. They each have their merits (Sax doesn't need to read the whole input into memory, for example).

There is also third-party module called lxml which is also a popular choice (full featured and fast).

Raymond Hettinger
  • 216,523
  • 63
  • 388
  • 485
  • 3
    And if you have performance issues with the element, there's lxml which provides a compatible interface but uses a battle-hardened, highly tuned C library behind the scenes. –  Nov 05 '11 at 19:03
  • 1
    ElementTree is "more Pythonic" mainly because you say myNode[3] instead of myNode.childNodes[3] to get the second child. It takes 2 lines of code to tweak any DOM implementation so you can do the same. More importantly, ElementTree treats text content vastly different from nearly every other tool, and makes some common tasks much more difficult. For example, to collect all the text, you have to not only recurse, but grab *2* properties off each node (text at the start of an element is stored differently than text that follows a sub-element!) – TextGeek Jan 01 '20 at 16:14
17

Python has two interfaces probably because Element Tree was integrated into the standard library a good deal later after minidom came to be. The reason for this was likely its far more "Pythonic" API compared to the W3C-controlled DOM.

If you're concerned about speed, there's also lxml, which builds an ElementTree-compatible DOM using libxml2 and should be quite fast – they have a benchmark suite comparing themselves to ElementTree's Python and C implementations available.

If you're concerned about memory use, you shouldn't be using a tree API anyway; PullDOM might be a better choice, but I'm extrapolating from experience using Java's excellent pull parser – there doesn't seem to be much current information on PullDOM.

millimoose
  • 39,073
  • 9
  • 82
  • 134