Why vertical tab is not escaped by ElementTree in python 3.5?

Question

I'm converting some text data to xml and use xml.etree.ElementTree to do so. Also I need to output pretty xml so I use this answer.

But my data contains some strange symbols like vertical tab (\v or \x0b). And when I convert my xml to string it is not escaped (which I suppose produces invalid xml) and then when I try to reparse it to pretty print it fails.

Here is the code example

import xml.etree.ElementTree as ET
import xml.dom.minidom as MD

root = ET.Element("root")
root.text = "some <<>> text \v other text"

rough_string = ET.tostring(root, 'utf-8')
reparsed = MD.parseString(rough_string)

Here rough_string contains following <root>some <<>> text other text</root>. It did escape the <<>> but missed the \v.

While if I do same on .NET it do escape it

XmlDocument doc = new XmlDocument();    
XmlElement priceElement = doc.CreateElement("root");
priceElement.InnerText = "some <<>> text \v other text";
doc.AppendChild(priceElement);  
string res = doc.OuterXml;

Result is <root>some <<>> text  other text</root>.

Is this a bug in ElementTree? How can I solve this issue?

UPDATE: Seems that the behavior of both ElementTree and .NET is incorrect as was pointed in comments. But how should I handle this? If these are some really tricky chars I could just remove them from the source string (I do not have very strict requirements for this), but I need to know the full list of such chars, where could I find one?

I can assume that there might be some illegal chars, but this should be somehow handled? .NET does handle it somehow? How can I handle it in python? — Pavel K, Apr 20 '18 at 09:20
Ok, but what is the correct way to handle it? Does the fact that this symbol is included in the output of ElementTree is also an incorrect behavior? — Pavel K, Apr 20 '18 at 09:33
Whether it's "incorrect" depends on the specification: because such checking is expensive, it's quite legitimate for an API to say "we don't check that the string contains only valid XML characters, this is the caller's responsibility". — Michael Kay, Apr 20 '18 at 10:06

score 0 · Answer 1 · answered Apr 20 '18 at 18:58

If you can't be sure about text content, you should enclose it in a CDATA section. Unfortunately, the plain \v is not accepted even inside a CDATA so you have two options

Remove it or replace it with space
Encoding as seems to work at list for a libxml2 utility (see example at bottom of answer). For python, lxml is based on libxml2 too.
```
echo -e "<?xml version=\"1.0\" encoding=\"UTF-8\"?><root><a><\041[CDATA[asdf\ndf \v nbbv]]></a></root>" | xmllint --format -
```

Result with error:

-:2: parser error : CData section not finished
asdf
d
df 
    nbbv]]></a></root>
^
-:2: parser error : PCDATA invalid Char value 11
df 
    nbbv]]></a></root>
^
-:2: parser error : Sequence ']]>' not allowed in content
df 
    nbbv]]></a></root>
    ^
-:2: parser error : Sequence ']]>' not allowed in content
df 
    nbbv]]></a></root>
        ^
-:2: parser error : internal error: detected an error in element content

df 
    nbbv]]></a></root>
        ^

Replacing with its HTML entity

echo -e "<?xml version=\"1.0\" encoding=\"UTF-8\"?><root><a><\041[CDATA[asdf\ndf &#xB; nbbv]]></a></root>" | xmllint --format -
<?xml version="1.0" encoding="UTF-8"?>
<root>
<a><![CDATA[asdf
df &#xB; nbbv]]></a>
</root>

Why vertical tab is not escaped by ElementTree in python 3.5?

1 Answers1

Linked