1

I'm converting some text data to xml and use xml.etree.ElementTree to do so. Also I need to output pretty xml so I use this answer.

But my data contains some strange symbols like vertical tab (\v or \x0b). And when I convert my xml to string it is not escaped (which I suppose produces invalid xml) and then when I try to reparse it to pretty print it fails.

Here is the code example

import xml.etree.ElementTree as ET
import xml.dom.minidom as MD

root = ET.Element("root")
root.text = "some <<>> text \v other text"

rough_string = ET.tostring(root, 'utf-8')
reparsed = MD.parseString(rough_string)

Here rough_string contains following <root>some &lt;&lt;&gt;&gt; text other text</root>. It did escape the <<>> but missed the \v.

While if I do same on .NET it do escape it

XmlDocument doc = new XmlDocument();    
XmlElement priceElement = doc.CreateElement("root");
priceElement.InnerText = "some <<>> text \v other text";
doc.AppendChild(priceElement);  
string res = doc.OuterXml;

Result is <root>some &lt;&lt;&gt;&gt; text &#xB; other text</root>.

Is this a bug in ElementTree? How can I solve this issue?

UPDATE: Seems that the behavior of both ElementTree and .NET is incorrect as was pointed in comments. But how should I handle this? If these are some really tricky chars I could just remove them from the source string (I do not have very strict requirements for this), but I need to know the full list of such chars, where could I find one?

Pavel K
  • 3,541
  • 2
  • 29
  • 44
  • 1
    https://stackoverflow.com/a/25926644/1030675 – choroba Apr 20 '18 at 09:14
  • I can assume that there might be some illegal chars, but this should be somehow handled? .NET does handle it somehow? How can I handle it in python? – Pavel K Apr 20 '18 at 09:20
  • .NET handles it wrongly, the entity `` doesn't exist. – choroba Apr 20 '18 at 09:30
  • Ok, but what is the correct way to handle it? Does the fact that this symbol is included in the output of ElementTree is also an incorrect behavior? – Pavel K Apr 20 '18 at 09:33
  • Whether it's "incorrect" depends on the specification: because such checking is expensive, it's quite legitimate for an API to say "we don't check that the string contains only valid XML characters, this is the caller's responsibility". – Michael Kay Apr 20 '18 at 10:06

1 Answers1

0

If you can't be sure about text content, you should enclose it in a CDATA section. Unfortunately, the plain \v is not accepted even inside a CDATA so you have two options

  • Remove it or replace it with space
  • Encoding as &#xB;seems to work at list for a libxml2 utility (see example at bottom of answer). For python, lxml is based on libxml2 too.

    echo -e "<?xml version=\"1.0\" encoding=\"UTF-8\"?><root><a><\041[CDATA[asdf\ndf \v nbbv]]></a></root>" | xmllint --format -
    

Result with error:

-:2: parser error : CData section not finished
asdf
d
df 
    nbbv]]></a></root>
^
-:2: parser error : PCDATA invalid Char value 11
df 
    nbbv]]></a></root>
^
-:2: parser error : Sequence ']]>' not allowed in content
df 
    nbbv]]></a></root>
    ^
-:2: parser error : Sequence ']]>' not allowed in content
df 
    nbbv]]></a></root>
        ^
-:2: parser error : internal error: detected an error in element content

df 
    nbbv]]></a></root>
        ^

Replacing with its HTML entity

echo -e "<?xml version=\"1.0\" encoding=\"UTF-8\"?><root><a><\041[CDATA[asdf\ndf &#xB; nbbv]]></a></root>" | xmllint --format -
<?xml version="1.0" encoding="UTF-8"?>
<root>
<a><![CDATA[asdf
df &#xB; nbbv]]></a>
</root>
LMC
  • 10,453
  • 2
  • 27
  • 52