1

I have a XML Exporter which creates feeds from my database and I have an escape method so that the XML-sensitive characters of my data do not conflict with the XML Markup.

This method is looking like this:

def escape(m_str):
    m_str = m_str.replace("&", "&")
    m_str = m_str.replace("\n", "<br />")
    m_str = m_str.replace("<", "&lt;")
    m_str = m_str.replace(">", "&gt;")
    m_str = m_str.replace("\"", "&quot;")
    return m_str

I'm using LXML library for this script and I have the following issue:

One of the description contains a \x03 (don't ask me why we have this character in a description but we have it).

For more visual people, here is a sample of the problematic description:

to_be_escaped
> 'gnebst G'
[(x,ord(x)) for x in to_be_escaped]
> <class 'list'>: [('g', 103), ('\x03', 3), ('n', 110), ('e', 101), ('b', 98), ('s', 115), ('t', 116), (' ', 32), ('G', 71)]

You can see that the first "space" is not really a space but a End of text character (ref) and the second is a "normal space" (dec. 32, ref)

The problem is that lxml reacts pretty bad to it, here is the code:

description = et.fromstring("<volltext>%s</volltext>" % cls.escape(job.description))

which outputs (with this character):

PCDATA invalid Char value 3, line 1

My questions are:

  • Of course, I could just extend my escape method to solve the problem but what guarantees me that it will not happen with another character?
  • Where can I find a list of the "forbidden" characters in LXML?
  • Did someone else deal with this kind of issue and as an appropriate escape method for that (as the built-in one doesn't do better than mine)?
Laurent Meyer
  • 2,766
  • 3
  • 33
  • 57

1 Answers1

1

I found the beginning of an answer there (all credits to the guy for the very clear explanation).

The issue is basically that the mapping for the utf-8 characters is not good enough per default and we need to specify that the source is encoded as utf8.

We can do it by changing the following line:

et.fromstring("<volltext>%s</volltext>" % cls.escape(job.description))

into

et.fromstring("<volltext>%s</volltext>" % cls.escape(job.description), parser=XMLParser(encoding='utf-8', recover=True))

in order to be much more resilient and robust.

Laurent Meyer
  • 2,766
  • 3
  • 33
  • 57