Subclassing ElementTree parser to retain comments

Question

Trying to use the ElementTree to parse xml files; since by default the parser does not retain comments, used the following code from http://bugs.python.org/issue8277:

import xml.etree.ElementTree as etree

class CommentedTreeBuilder(etree.TreeBuilder):
    """A TreeBuilder subclass that retains comments."""

    def comment(self, data):
        self.start(etree.Comment, {})
        self.data(data)
        self.end(etree.Comment)

parser = etree.XMLParser(target = CommentedTreeBuilder())

The above is in documents.py. Tested with:

class TestDocument(unittest.TestCase):

    def setUp(self):
        filename = os.path.join(sys.path[0], "data", "facilities.xml")
        self.doc = etree.parse(filename, parser = documents.parser)

    def testClass(self):
        print("Class is {0}.".format(self.doc.__class__.__name__))
        #commented out tests.

if __name__ == '__main__':
    unittest.main()

This barfs with:

Traceback (most recent call last):
File "/home/goncalo/documents/games/ja2/modding/mods/xml-overhaul/src/scripts/../tests/test_documents.py", line 24, in setUp
    self.doc = etree.parse(filename, parser = documents.parser)
File "/usr/lib/python3.3/xml/etree/ElementTree.py", line 1242, in parse
    tree.parse(source, parser)
File "/usr/lib/python3.3/xml/etree/ElementTree.py", line 1726, in parse
    parser.feed(data)
IndexError: pop from empty stack

What am I doing wrong? By the way, the xml in the file is valid (as checked by an independent program) and in utf-8 encoding.

note(s):

using Python 3.3. In Kubuntu 13.04, just in case it is relevant. I make sure to use "python3" (and not just "python") to run the test scripts.

edit: here is the sample xml file used; it is very small (let's see if I can get the formatting right):

<?xml version="1.0" encoding="utf-8"?>
<!-- changes to facilities.xml by G. Rodrigues: ar overhaul.-->
<SECTORFACILITIES>
    <!-- Drassen -->
    <!-- Small airport -->
    <FACILITY>
        <SectorGrid>B13</SectorGrid>
        <FacilityType>4</FacilityType>
        <ubHidden>0</ubHidden>
    </FACILITY>
</SECTORFACILITIES>

I think this is a pretty good question. What could make this a *great* question is if you could provide a very short example xml which causes this to fail. — mgilson, Dec 13 '13 at 19:50
Did you try using Amaury's code from the ticket? His uses the newer ElementTree from 2.7 / 3.x, as opposed to the one you tried. — Lukas Graf, Dec 13 '13 at 19:51
@LukasGraf -- OP is using python3.3. Can you explain how the ElementTree from 2.7 / 3.x is *newer*? I'm not sure that I follow :) — mgilson, Dec 13 '13 at 19:52
Newer than the version Patrick Westerhoff apparently used, see the ticket he linked. Amaury provided some code for the new ET, didn't work for Patrick because I assume he used an older version. OP tried Patrick's code even though he's on 3.3. — Lukas Graf, Dec 13 '13 at 19:55
@mgilson Ok, forget what I said, I totally mixed up the code and the respective authors, sorrry for the noise. — Lukas Graf, Dec 13 '13 at 19:59
@G.Rodrigues yeah, I just realized, sorry. The code and the test case you provided pass for me though on 2.7 and 3.3, using the sample XML from the first message in the ticket. — Lukas Graf, Dec 13 '13 at 20:03
The example XML you added works for me 2.7, but breaks on 3.3. — Lukas Graf, Dec 13 '13 at 20:18
The problem seems to be the first comment - after the XML declaration, before the first element. It isn't part of the tree in 2.7 (doesn't raise an Exception though), and causes the exception in 3.3. — Lukas Graf, Dec 13 '13 at 20:23
@Lukas Graf: thanks. Just tested, by deleting the first comment and indeed no error occurs. — G. Rodrigues, Dec 13 '13 at 21:01

score 2 · Accepted Answer · answered Dec 13 '13 at 21:06

The example XML you added works for me in 2.7, but breaks on 3.3 with the stack trace you described.

The problem seems to be the very first comment - after the XML declaration, before the first element. It isn't part of the tree in 2.7 (doesn't raise an Exception though), and causes the exception in 3.3.

See Python issue #17901: In Python 3.4, which contains the mentioned fix, pop from empty stack doesn't occur, but ParseError: multiple elements on top level is raised instead.

Which makes sense: If you want to retain the comments in the tree, they need to be trated as nodes. And XML only allows one node at the top level of the document, so you can't have a comment before the first "real" element (if you force the parser to retain commments).

So unfortunately I think that's your only option: Remove those comments outside the root document node from your XML files - either in the original files, or by stripping them before parsing.

Subclassing ElementTree parser to retain comments

1 Answers1

Linked