23

I have a set of super simple XML files to parse... but... they use custom defined entities. I don't need to map these to characters, but I do wish to parse and act on each one. For example:

<Style name="admin-5678">
    <Rule>
      <Filter>[admin_level]='5'</Filter>
      &maxscale_zoom11;
    </Rule>
</Style>

There is a tantalizing hint at http://effbot.org/elementtree/elementtree-xmlparser.htm that XMLParser has limited entity support, but I can't find the methods mentioned, everything gives errors:

    #!/usr/bin/python
    ##
    ## Where's the entity support as documented at:
    ## http://effbot.org/elementtree/elementtree-xmlparser.htm
    ## In Python 2.7.1+ ?
    ##
    from pprint     import pprint
    from xml.etree  import ElementTree
    from cStringIO  import StringIO

    parser = ElementTree.ElementTree()
   #parser.entity["maxscale_zoom11"] = unichr(160)
    testf = StringIO('<foo>&maxscale_zoom11;</foo>')
    tree = parser.parse(testf)
   #tree = parser.parse(testf,"XMLParser")
    for node in tree.iter('foo'):
        print node.text

Which depending on how you adjust the comments gives:

xml.etree.ElementTree.ParseError: undefined entity: line 1, column 5

or

AttributeError: 'ElementTree' object has no attribute 'entity'

or

AttributeError: 'str' object has no attribute 'feed'           

For those curious the XML is from the OpenStreetMap's mapnik project.

Bryce
  • 8,313
  • 6
  • 55
  • 73
  • Possibly related question: http://stackoverflow.com/questions/2524299/entity-references-and-lxml – unutbu Aug 30 '11 at 01:11
  • Not related, because in that case the entity is actually defined. Remove the entity definition and you're back to my question. – Bryce Aug 30 '11 at 06:21
  • fyi - someone may want to fix the /usr/bin/python to /usr/bin/env python as the shebang line is wrong for most systems. – Good Person Nov 12 '12 at 02:41

2 Answers2

16

As @cnelson already pointed out in a comment, the chosen solution here won't work in Python 3.

I finally got it working. Quoted from this Q&A.

Inspired by this post, we can just prepend some XML definition to the incoming raw HTML content, and then ElementTree would work out of box.

This works for both Python 2.6, 2.7, 3.3, 3.4.

import xml.etree.ElementTree as ET

html = '''<html>
    <div>Some reasonably well-formed HTML content.</div>
    <form action="login">
    <input name="foo" value="bar"/>
    <input name="username"/><input name="password"/>

    <div>It is not unusual to see &nbsp; in an HTML page.</div>

    </form></html>'''

magic = '''<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN"
            "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd" [
            <!ENTITY nbsp ' '>
            ]>'''  # You can define more entities here, if needed

et = ET.fromstring(magic + html)
Community
  • 1
  • 1
RayLuo
  • 17,257
  • 6
  • 88
  • 73
  • 1
    This only applies to HTML documents though, right? The notion of 'DOCTYPE' processing instructions does not apply to "simple XML files " as the OP is apparently dealing with. – Frerich Raabe Oct 23 '19 at 13:27
  • @FrerichRaabe, Sorry, admittedly I did not test it on XML docs. That answer was quoted from [here](https://stackoverflow.com/questions/35591478/how-to-parse-html-with-entities-such-as-nbsp-using-builtin-library-elementtree), and was hoping it would be helpful. That original Q&A link contains another answer that may or may not help in your situation. – RayLuo Oct 23 '19 at 20:07
14

I'm not sure if this is a bug in ElementTree or what, but you need to call UseForeignDTD(True) on the expat parser to behave the way it did in the past.

It's a bit hacky, but you can do this by creating your own instance of ElementTree.Parser, calling the method on it's instance of xml.parsers.expat, and then passing it to ElementTree.parse():

from xml.etree  import ElementTree
from cStringIO  import StringIO


testf = StringIO('<foo>&moo_1;</foo>')

parser = ElementTree.XMLParser()
parser.parser.UseForeignDTD(True)
parser.entity['moo_1'] = 'MOOOOO'

etree = ElementTree.ElementTree()

tree = etree.parse(testf, parser=parser)

for node in tree.iter('foo'):
    print node.text

This outputs "MOOOOO"

Or using a mapping interface:

from xml.etree  import ElementTree
from cStringIO  import StringIO

class AllEntities:
    def __getitem__(self, key):
        #key is your entity, you can do whatever you want with it here
        return key

testf = StringIO('<foo>&moo_1;</foo>')

parser = ElementTree.XMLParser()
parser.parser.UseForeignDTD(True)
parser.entity = AllEntities()

etree = ElementTree.ElementTree()

tree = etree.parse(testf, parser=parser)

for node in tree.iter('foo'):
    print node.text

This outputs "moo_1"

A more complex fix would be to subclass ElementTree.XMLParser and fix it there.

cnelson
  • 1,355
  • 11
  • 14
  • A bit icky as as you say, but thanks. Is there any way to avoid having to predefine the entities (e.g. &moo_2). – Bryce Sep 01 '11 at 06:25
  • @Bryce: being predefined is the point of entities, no? Nevertheless: you could set `parser.entity` to your own dictionary-like object. As a simple example, you could do `parser.entity = collections.defaultdict(str)` to have all undefined entities replaced by an empty string. – Steven Sep 01 '11 at 14:12
  • To follow up @Steven's comment, you could also implement a mapping interface and do whatever you want with the keys. I edited my answer to show a simple example of that. – cnelson Sep 01 '11 at 14:52
  • The code works with Python 2.7 (in earlier versions `parse()` does not accept a `parser` keyword argument). – mzjn Sep 01 '11 at 16:10
  • Yes, 2.7+; I was answering the OP's question that was hidden in the comments of his sample code: ## Where's the entity support as documented at http://effbot.org/elementtree/elementtree-xmlparser.htm In Python 2.7.1+?. – cnelson Sep 01 '11 at 16:56
  • 1
    This won't work in Python 3 with cpython, where the C versions (formerly `cElementTree` are being used instead). – phihag Aug 02 '13 at 14:02
  • 3
    I'm not sure if this is possible at all in Python 3 currently. Looking at the the [docs](http://docs.python.org/3/library/xml.etree.elementtree.html#xmlparser-objects) I see the following method signature **xml.etree.ElementTree.XMLParser(html=0, target=None, encoding=None)** but the docs say _Element structure builder for XML source data, based on the expat parser. html are predefined HTML entities. This flag is not supported by the current implementation._ It looks like element tree is getting more strict and if your entities aren't defined, then it's not valid and won't be parsed. – cnelson Aug 11 '13 at 15:44
  • 1
    I had this working in 2.7 in an overridden ElementTree XMLParser, but I can no longer extend that in 3.5 as you point out, because inheriting from the cElementTree parser is not possible. Not sure if I should be pushing my content through a custom codec before parsing, or what. Is there a standard answer for python 3? – Epu Dec 23 '15 at 21:13