I have a XML Exporter which creates feeds from my database and I have an escape method so that the XML-sensitive characters of my data do not conflict with the XML Markup.
This method is looking like this:
def escape(m_str):
m_str = m_str.replace("&", "&")
m_str = m_str.replace("\n", "<br />")
m_str = m_str.replace("<", "<")
m_str = m_str.replace(">", ">")
m_str = m_str.replace("\"", """)
return m_str
I'm using LXML library for this script and I have the following issue:
One of the description contains a \x03
(don't ask me why we have this character in a description but we have it).
For more visual people, here is a sample of the problematic description:
to_be_escaped
> 'gnebst G'
[(x,ord(x)) for x in to_be_escaped]
> <class 'list'>: [('g', 103), ('\x03', 3), ('n', 110), ('e', 101), ('b', 98), ('s', 115), ('t', 116), (' ', 32), ('G', 71)]
You can see that the first "space" is not really a space but a End of text
character (ref) and the second is a "normal space" (dec. 32, ref)
The problem is that lxml reacts pretty bad to it, here is the code:
description = et.fromstring("<volltext>%s</volltext>" % cls.escape(job.description))
which outputs (with this character):
PCDATA invalid Char value 3, line 1
My questions are:
- Of course, I could just extend my escape method to solve the problem but what guarantees me that it will not happen with another character?
- Where can I find a list of the "forbidden" characters in LXML?
- Did someone else deal with this kind of issue and as an appropriate escape method for that (as the built-in one doesn't do better than mine)?