I'm converting some text data to xml and use xml.etree.ElementTree
to do so. Also I need to output pretty xml so I use this answer.
But my data contains some strange symbols like vertical tab (\v
or \x0b
). And when I convert my xml to string it is not escaped (which I suppose produces invalid xml) and then when I try to reparse it to pretty print it fails.
Here is the code example
import xml.etree.ElementTree as ET
import xml.dom.minidom as MD
root = ET.Element("root")
root.text = "some <<>> text \v other text"
rough_string = ET.tostring(root, 'utf-8')
reparsed = MD.parseString(rough_string)
Here rough_string
contains following <root>some <<>> text other text</root>
. It did escape the <<>> but missed the \v.
While if I do same on .NET it do escape it
XmlDocument doc = new XmlDocument();
XmlElement priceElement = doc.CreateElement("root");
priceElement.InnerText = "some <<>> text \v other text";
doc.AppendChild(priceElement);
string res = doc.OuterXml;
Result is <root>some <<>> text  other text</root>
.
Is this a bug in ElementTree? How can I solve this issue?
UPDATE: Seems that the behavior of both ElementTree and .NET is incorrect as was pointed in comments. But how should I handle this? If these are some really tricky chars I could just remove them from the source string (I do not have very strict requirements for this), but I need to know the full list of such chars, where could I find one?