3

I am writing an XML file using lxml and am having issues with control characters. I am reading text from a file to assign to an element that contains control characters. When I run the script I receive this error:

ValueError: All strings must be XML compatible: Unicode or ASCII, no NULL bytes or control characters

So I wrote a small function to replace the control characters with a '?', when I look at the generated XML it appears that the control characters are new lines 0x0A. With this knowledge I wrote a function to encode there control characters :

def encodeXMLText(text):
    text = text.replace("&",  "&")
    text = text.replace("\"", """)
    text = text.replace("'",  "'")
    text = text.replace("<",  "&lt;")
    text = text.replace(">",  "&gt;")
    text = text.replace("\n", "&#xA;")
    text = text.replace("\r", "&#xD;")
    return text

This still returns the same error as before. I want to preserve the new lines so simply stripping them isn't a valid option for me. No idea what I am doing wrong at this point. I am looking for a way to do this with lxml, similar to this:

  ruleTitle = ET.SubElement(rule,'title')
  ruleTitle.text = encodeXMLText(titleText)

The other questions I have read either don't use lxml or don't address new line (/n) and line feed (/r) characters as control characters

Bellerofont
  • 1,081
  • 18
  • 17
  • 16
Joel Parker
  • 295
  • 1
  • 4
  • 17
  • Possible duplicate of [Python: Escaping strings for use in XML](http://stackoverflow.com/questions/1546717/python-escaping-strings-for-use-in-xml) – DeepSpace Jan 27 '17 at 18:51
  • This answer seemed to work for me https://stackoverflow.com/questions/8733233/filtering-out-certain-bytes-in-python – slobbity May 26 '21 at 14:44

1 Answers1

0

I printed out the string to see what specific characters were causing the issue and noticed these characters : \xe2\x80\x99 in the text. So the issue was the encoding, changing the code to look like this fixed my issue:

ruleTitle = ET.SubElement(rule,'title')
ruleTitle.text = titleText.decode('UTF-8')
Joel Parker
  • 295
  • 1
  • 4
  • 17