2

How can I resolve External Unparsed Entity during parsing with lxml?

Here is my code example:

import io

from lxml import etree

content = b"""\
<?xml version="1.0"?>
<!DOCTYPE sample [
<!NOTATION jpeg SYSTEM "image/jpeg">
<!ENTITY ref1 SYSTEM "python-logo-small.jpg" NDATA jpeg>
<!ELEMENT sample EMPTY>
<!ATTLIST sample src ENTITY #REQUIRED>
]>
<sample src="ref1"/>
"""

parser = etree.XMLParser(dtd_validation=True, resolve_entities=True)
doc = etree.parse(io.BytesIO(content), parser=parser)
print(etree.tostring(doc))

Note: I'm using lxml >= 3.4

Currently I have the following result:

<!DOCTYPE sample [
<!NOTATION jpeg SYSTEM "image/jpeg" >
<!ENTITY ref1 SYSTEM "python-logo-small.jpg" NDATA jpeg>
<!ELEMENT sample EMPTY>
<!ATTLIST sample src ENTITY #REQUIRED>
]>
<sample src="ref1"/>

Here, the ref1 entity isn't resolved to "python-logo-small.jpg". I expected to have <sample src="python-logo-small.jpg"/>. Is there something wrong?

I also try with:

parser = etree.XMLParser(dtd_validation=True, resolve_entities=True, load_dtd=True)

But I have the same result.

Alternatively, I'd like to resole the entities myself. To do that, I try to list the entities that way:

for entity in doc.docinfo.internalDTD.iterentities():
    msg_fmt = "{entity.name!r}, {entity.content!r}, {entity.orig!r}"
    print(msg_fmt.format(entity=entity))

But I only get the entity's and the notation's names, not the entity's definition:

'ref1', 'jpeg', None

How to access to the entity's definition?

Remi Guan
  • 21,506
  • 17
  • 64
  • 87
Laurent LAPORTE
  • 21,958
  • 6
  • 58
  • 103

2 Answers2

0

The XML document with the unparsed entity looks OK. But unparsed entities do not get resolved in the way you seem to expect. If you want to see <sample src="python-logo-small.jpg"/> in the parsed output, use an internal (parsed) entity.

Example:

import io
from lxml import etree

content = b"""\
<?xml version="1.0"?>
<!DOCTYPE sample [
<!ENTITY ref1 "python-logo-small.jpg">
<!ELEMENT sample EMPTY>
<!ATTLIST sample src CDATA #REQUIRED>
]>
<sample src="&ref1;"/>
"""

parser = etree.XMLParser(dtd_validation=True, resolve_entities=True)
doc = etree.parse(io.BytesIO(content), parser=parser)
print(etree.tostring(doc))

Output:

<!DOCTYPE sample [
<!ENTITY ref1 "python-logo-small.jpg">
<!ELEMENT sample EMPTY>
<!ATTLIST sample src CDATA #REQUIRED>
]>
<sample src="python-logo-small.jpg"/>

Notes:

  • The ref1 entity is declared as an internal entity.
  • The entity is referenced with &ref1;.
  • The src attribute is declared as type CDATA.

You can get the value (URI) of unparsed entities with XSLT, using the unparsed-entity-uri function. To see it in action, add the following lines to the code example in the question:

xsl = etree.XML('''\
<xsl:stylesheet version="1.0" 
     xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
 <xsl:output encoding="utf-8" omit-xml-declaration="yes"/>
 <xsl:template match="sample">
   <xsl:value-of select="unparsed-entity-uri(@src)"/>
 </xsl:template>
</xsl:stylesheet>
''')

transform = etree.XSLT(xsl)
result = transform(doc)
print result

Output:

python-logo-small.jpg
mzjn
  • 48,958
  • 13
  • 128
  • 248
  • Unfortunately, I'm not the author of the XML content, so I can't modify it. Is there a way to list the internal entities? Any example with xml.sax.handler? – Laurent LAPORTE Sep 24 '15 at 12:20
0

OK, it's impossible to "resolve" external unparsed entities, but we can list them:

import io

import xml.sax

content = b"""\
<?xml version="1.0"?>
<!DOCTYPE sample [
<!NOTATION jpeg SYSTEM "image/jpeg">
<!ENTITY ref1 SYSTEM "python-logo-small.jpg" NDATA jpeg>
<!ELEMENT sample EMPTY>
<!ATTLIST sample src ENTITY #REQUIRED>
]>
<sample src="ref1"/>
"""


class MyDTDHandler(xml.sax.handler.DTDHandler):
    def __init__(self):
        pass

    def unparsedEntityDecl(self, name, publicId, systemId, ndata):
        print(dict(name=name, publicId=publicId, systemId=systemId, ndata=ndata))
        xml.sax.handler.DTDHandler.unparsedEntityDecl(self, name, publicId, systemId, ndata)


parser = xml.sax.make_parser()
parser.setDTDHandler(MyDTDHandler())
parser.parse(io.BytesIO(content))

The result is:

{'systemId': u'python-logo-small.jpg', 'ndata': u'jpeg', 'publicId': None, 'name': u'ref1'}

So the work is done.

Laurent LAPORTE
  • 21,958
  • 6
  • 58
  • 103