Python ElementTree does not like colon in name of processing instruction

Question

The following code:

import xml.etree.ElementTree as ET

xml = '''\
<?xml version="1.0" encoding="UTF-8"?>
<testCaseConfig>
    <?LazyComment Blah de blah/?>   
    <testCase runLimit="420" name="d1/n1"/>
    <testCase runLimit="420" name="d1/n2"/>
</testCaseConfig>'''

root = ET.fromstring(xml)

xml2 = xml.replace('LazyComment ', 'LazyComment:')
print(xml2)
try:
    root2 = ET.fromstring(xml2)
except ET.ParseError:
    print("\nERROR in xml2!!!\n")

xml3 = xml2.replace('testCaseConfig', 'testCaseConfig xmlns:Blah="http://www.w3.org/TR/html4/"', 1)
print(xml3)
try:
    root3 = ET.fromstring(xml3)
except ET.ParseError:
    print("\nERROR in xml3!!!\n")
    raise

Gives this output:

<?xml version="1.0" encoding="UTF-8"?>
<testCaseConfig>
    <?LazyComment:Blah de blah/?>   
    <testCase runLimit="420" name="d1/n1"/>
    <testCase runLimit="420" name="d1/n2"/>
</testCaseConfig>

ERROR in xml2!!!

<?xml version="1.0" encoding="UTF-8"?>
<testCaseConfig xmlns:Blah="http://www.w3.org/TR/html4/">
    <?LazyComment:Blah de blah/?>   
    <testCase runLimit="420" name="d1/n1"/>
    <testCase runLimit="420" name="d1/n2"/>
</testCaseConfig>

ERROR in xml3!!!

Traceback (most recent call last):
  File "C:\Users\Paddy3118\Google Drive\Code\elementtree_error.py", line 30, in <module>
    root3 = ET.fromstring(xml3)
  File "C:\Anaconda3\envs\Py3.5\lib\xml\etree\ElementTree.py", line 1333, in XML
    parser.feed(text)
xml.etree.ElementTree.ParseError: not well-formed (invalid token): line 3, column 17

I searched and found this Q that pointed to other resources that I read.

It seems that the '?' makes it a processing instruction whose tag name can include colons. Without the '?' then a colon in a name indicates namespace and one of the answers stated that defining the namespace should make things work.

Combining '?' and ':' though causes issues with ElementTree.

I am given xml files of this type that are used by other tools that do parse it OK and want to process the files myself using Python. Any ideas?

Thanks.

Parfait · Accepted Answer · 2016-07-15T02:56:51.863

According to the W3C Extensible Markup Language 1.0 Specifications under Common Syntactic Constructs:

The Namespaces in XML Recommendation [XML Names] assigns a meaning to names containing colon characters. Therefore, authors should not use the colon in XML names except for namespace purposes, but XML processors must accept the colon as a name character.

And further in the W3C XPath 1.0 note on Processing Instruction nodes:

A processing instruction has an expanded-name: the local part is the processing instruction's target; the namespace URI is null.

Altogether, <?LazyComment:Blah de blah/?> is an invalid processing instruction as colons is used to reference namespace URIs and for processing instructions that part is null or empty. Therefore, Python's XML processor complains that using such an instruction does not render a well-formed XML.

Also, reconsider such tools that are generating such invalid processing instructions as they are not handling valid XML documents. Possibly, such tools are treating XML files as text documents (similar to the way you were able to replace the string representation of XML but would not have been able to append an instruction using etree).

Ah! I had missed the XPath note when searching w3C, thanks. The files are hand edited but are read by some C++ XML parsing library that seems to accept this error. I have already added a xml pre-filter to my script where I replace this occurrence with a space and all is well. I just wanted to know if Elementtree was rejecting valid xml which would mean alerting its maintainers. Thanks again. — Paddy3118, Jul 15 '16 at 20:44

score 0 · Answer 2 · answered Jul 13 '16 at 11:12

0

<?xml version="1.0" encoding="UTF-8"?>
<testCaseConfig xmlns:Blah="http://www.w3.org/TR/html4/">
    <?LazyComment:Blah de blah/?>   
    <testCase runLimit="420" name="d1/n1"/>
    <testCase runLimit="420" name="d1/n2"/>
</testCaseConfig xmlns:Blah="http://www.w3.org/TR/html4/">

Is invalid XML. You can't have attributes in the closing tag. The last line should be just </testCaseConfig>

Also comments are written like this

<!-- this is a comment -->

answered Jul 13 '16 at 11:12

mowcow

81
4

Thanks for ointing that out. It is not a feature of my problem so I have edited my question to do the xmlns insertion only once. – Paddy3118 Jul 14 '16 at 22:18

Python ElementTree does not like colon in name of processing instruction

2 Answers2

Linked