23

Trying to parse XML file into ElementTree:

>>> import xml.etree.cElementTree as ET
>>> tree = ET.ElementTree(file='D:\Temp\Slikvideo\JPEG\SV_4_1_mask\index.xml')

I get following error:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "C:\Program Files\Anaconda2\lib\xml\etree\ElementTree.py", line 611, in __init__
    self.parse(file)
  File "<string>", line 38, in parse
ParseError: junk after document element: line 3, column 0

XML file starts like this:

<?xml version="1.0" encoding="UTF-8" ?>
<Version Writer="E:\d\src\Modules\SceneSerialization\src\mitkSceneIO.cpp" Revision="$Revision: 17055 $" FileVersion="1" />
<node UID="OBJECT_2016080819041580480127">
    <source UID="OBJECT_2016080819041550469454" />
    <data type="LabelSetImage" file="hfbaaa_Bolus.nrrd" />
    <properties file="sicaaa" />
</node>
<node UID="OBJECT_2016080819041512769572">
    <source UID="OBJECT_2016080819041598947781" />
    <data type="LabelSetImage" file="ifbaaa_Bolus.nrrd" />
    <properties file="ticaaa" />
</node>

followed by many more nodes.

I do not see any junk in line 3, column 0? I assume there must be another reason for the error.

The .xml file is generated by external software MITK so I assume that should be ok.

Working on Win 7, 64 bit, VS2015, Anaconda

Martin Valgur
  • 5,793
  • 1
  • 33
  • 45
jdelange
  • 763
  • 2
  • 10
  • 22
  • That XML isn't well-formed. There is no root element that contains all other elements. – Ian McLaird Aug 09 '16 at 14:36
  • Unrelated to the question, you should consider either escaping the Windows path string literal ("...\\...") or use raw strings (r"...\..."). – Martin Valgur Aug 09 '16 at 14:39
  • @Martin: thanks, agree. Done that in other parts of the code. – jdelange Aug 09 '16 at 14:41
  • 1
    In my case, the simple solution was embedding the tree caller in a `try: ... / except: pass` block, for anyone who simply does not care about one out of 100s of files. :)) – questionto42 Nov 20 '20 at 15:16

3 Answers3

35

As @Matthias Wiehl said, ElementTree expects only a single root node and is not well-formed XML, which should be fixed at its origin. As a workaround you can add a fake root node to the document.

import xml.etree.cElementTree as ET
import re

with open("index.xml") as f:
    xml = f.read()
tree = ET.fromstring(re.sub(r"(<\?xml[^>]+\?>)", r"\1<root>", xml) + "</root>")
Martin Valgur
  • 5,793
  • 1
  • 33
  • 45
  • Martin, that's an elegant fix. This works when importing etree.ElementTree, if I use the cEmelentTree I get an error in cElementTree.py un(shallow)copyable object of type . I need to figure out why. – jdelange Aug 09 '16 at 15:29
3

The root node of your document (Version) is opened and closed on line 2. The parser does not expect any nodes after the root node. Solution is to remove the closing forward slash.

Matthias Wiehl
  • 1,799
  • 16
  • 22
  • 1
    Assuming I need to parse this file (I cannot generate a different format), what would be a quick fix? Copy the file and create a dummy that is properly formatted and then parse that? What should I change? Should I put the closing forward slash at the end of the document? – jdelange Aug 09 '16 at 14:40
  • As was pointed out correctly, the document is not well-formed. The software that generated it is broken. You should file a bug report. – Matthias Wiehl Aug 09 '16 at 14:46
0

Try repairing the document like this. Close the version element at the end

<?xml version="1.0" encoding="UTF-8" ?>
<Version Writer="E:\d\src\Modules\SceneSerialization\src\mitkSceneIO.cpp" Revision="$Revision: 17055 $" FileVersion="1">
    <node UID="OBJECT_2016080819041580480127">
        <source UID="OBJECT_2016080819041550469454" />
        <data type="LabelSetImage" file="hfbaaa_Bolus.nrrd" />
        <properties file="sicaaa" />
    </node>
    <node UID="OBJECT_2016080819041512769572">
        <source UID="OBJECT_2016080819041598947781" />
        <data type="LabelSetImage" file="ifbaaa_Bolus.nrrd" />
        <properties file="ticaaa" />
    </node>
</Version>
Raja Sattiraju
  • 1,262
  • 1
  • 20
  • 42