Python XML Compatible String

Question

I am writing an XML file using lxml and am having issues with control characters. I am reading text from a file to assign to an element that contains control characters. When I run the script I receive this error:

ValueError: All strings must be XML compatible: Unicode or ASCII, no NULL bytes or control characters

So I wrote a small function to replace the control characters with a '?', when I look at the generated XML it appears that the control characters are new lines 0x0A. With this knowledge I wrote a function to encode there control characters :

def encodeXMLText(text):
    text = text.replace("&",  "&amp;")
    text = text.replace("\"", "&quot;")
    text = text.replace("'",  "&apos;")
    text = text.replace("<",  "&lt;")
    text = text.replace(">",  "&gt;")
    text = text.replace("\n", "&#xA;")
    text = text.replace("\r", "&#xD;")
    return text

This still returns the same error as before. I want to preserve the new lines so simply stripping them isn't a valid option for me. No idea what I am doing wrong at this point. I am looking for a way to do this with lxml, similar to this:

  ruleTitle = ET.SubElement(rule,'title')
  ruleTitle.text = encodeXMLText(titleText)

The other questions I have read either don't use lxml or don't address new line (/n) and line feed (/r) characters as control characters

Possible duplicate of [Python: Escaping strings for use in XML](http://stackoverflow.com/questions/1546717/python-escaping-strings-for-use-in-xml) — DeepSpace, Jan 27 '17 at 18:51
This answer seemed to work for me https://stackoverflow.com/questions/8733233/filtering-out-certain-bytes-in-python — slobbity, May 26 '21 at 14:44

score 0 · Accepted Answer · answered Jan 28 '17 at 17:08

I printed out the string to see what specific characters were causing the issue and noticed these characters : \xe2\x80\x99 in the text. So the issue was the encoding, changing the code to look like this fixed my issue:

ruleTitle = ET.SubElement(rule,'title')
ruleTitle.text = titleText.decode('UTF-8')

Python XML Compatible String

1 Answers1