2

I now use lxml module to generate XML file by Python.

We must define some entity references to be parsed in our external system. Normally, all text string of elements are escaped on output to XML string:

from lxml import etree
root = etree.Element("root")
sub = etree.Element("sub")
sub.text = "&entity;text"
root.append(sub)
print etree.tostring(root)
'<root><sub>&amp;entity;text</sub></root>' # I want to get without escaping

I found lxml.etree.Entity class is useful for this purpose.:

root = etree.Element("root")
sub = etree.Element("sub")
entity = etree.Entity("entity")
entity.tail = "text"
sub.append(entity)
root.append(sub)
print etree.tostring(root)
'<root><sub>&entity;text</sub></root>'

However, if we set text with entity reference to value of attribute, it fails:

root = etree.Element("root")
sub = etree.Element("sub")
entity = etree.Entity("entity")
entity.tail = "text"
sub.attrib["foo"] = entity

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-52-62cb8ef3a9a6> in <module>()
----> 1 sub.attrib["foo"] = entity

lxml.etree.pyx in lxml.etree._Attrib.__setitem__ (src/lxml/lxml.etree.c:58775)()

apihelpers.pxi in lxml.etree._setAttributeValue (src/lxml/lxml.etree.c:19025)()

apihelpers.pxi in lxml.etree._utf8 (src/lxml/lxml.etree.c:26460)()

TypeError: Argument must be bytes or unicode, got '_Entity'

What I want to get is like:

<?xml version='1.0' encoding='utf-8'?>
<!DOCTYPE foo [
  <!ENTITY ent "entity" >
  <!ENTITY aaa "aaaaaa" >
]>
<foo>
  <sub bar="&ent;bas">&aaa;bbb</sub>
<foo>

How can we define generator for that?

furushchev
  • 2,539
  • 2
  • 10
  • 16
  • Did you look at http://stackoverflow.com/q/1328538/2823755 ? – wwii Nov 16 '16 at 19:02
  • @wwii Yes. So you mean there is no way to achieve this feature? – furushchev Nov 17 '16 at 15:10
  • From what I have been reading, this seems to be a *safety* feature. [This SO answer](http://stackoverflow.com/a/1091953/2823755) uses the phrase **must be escaped** regarding entities as attribute values. The [W3C Tutorial shows unescaped attributes](http://www.w3schools.com/xml/xml_attributes.asp). You might have to resort to string replacemen, if the *external parser* cannot handle them. ```sub.attrib["foo"] = entity.text``` works but ```tostring``` still escapes it .. ```sub.attrib ---> {'foo': '&entity;', 'bar': '&entity;'}```. – wwii Nov 17 '16 at 17:01
  • .... https://www.w3.org/TR/xml/#syntax - ```The ampersand character (&) and the left angle bracket (<) must not appear in their literal form, except when used as markup delimiters, or within a comment, a processing instruction, or a CDATA section. If they are needed elsewhere, they must be escaped ....``` – wwii Nov 17 '16 at 18:01
  • @wwii Thank you for further investigation. Hmm, it is complicated. In this time, I solved by replacing all escaped entity references to unescaped ones after stringification. – furushchev Nov 23 '16 at 16:19
  • I was hoping you would find, or someone would post a different/better solution. I am not XML literate. – wwii Nov 23 '16 at 16:26

0 Answers0