ElementTree fails with unsafe characters in strings

Question

I am trying to parse an xml which contains regexes like so:

<conditions>
  <condition pattern_matches="regex string"/>
</conditions>

However, when regex contains an unsafe character, for instance (<=a).*b$, ElementTree raises a ParseError saying that xml is not well formed at < character, even though the character is inside quotes.

I can of use < instead of < and then once parsed replace all such characters, but that renders complex regexes very difficult to read and requires rewriting regexes which contain such character combinations so as not to create false-positives, and loading raw xml file and then swapping characters for their safe variations, just to swap them back immediately afterwards seems unneccessarily cpu intensive.

How should I approach this problem? Is this too complicated for ElementTree or am I doing something wrong?

score 1 · Accepted Answer · edited May 23 '17 at 11:44

It is requirement from XML specification that < has to be escaped to <. Every sane XML processor must be following the specification. See related discussion: Invalid Characters in XML .

That said, if you create the XML using an XML processor such as ElementTree, it would have handled the escaping and unescaping process for you. For example, given plain regex string which contains <, ElementTree automatically replaces it with < :

>>> from xml.etree import ElementTree as et
>>> root = et.Element("conditions")
>>> regex_str = "(<=a).*b$"
>>> sub = et.SubElement(root, "condition", attrib = {"pattern_matches": regex_str})
>>> et.tostring(root)
'<conditions><condition pattern_matches="(&lt;=a).*b$" /></conditions>'

... and it will automatically replace it back to < upon reading the attribute value :

>>> sub.attrib["pattern_matches"]
'(<=a).*b$'

Hmm, I assumed that I don't need to escape non-quotation characters in attribute value strings... Thanks. — Mirac7, May 30 '16 at 13:44

ElementTree fails with unsafe characters in strings

1 Answers1