2

I summarize: minidom appears not to like 8859 named entities; what's an appropriate resolution?

Here's code which illustrates my situation:

sample = """
  <html>
    <body>
      <h1>Un ejemplo</h1>
      <p>Me llamo Juan Fulano y Hern&aacute;ndez.</p>
    </body>
  </html>
"""
sample2 = sample.replace("&aacute;", "&#225;")

import xml.dom.minidom

dom2 = xml.dom.minidom.parseString(sample2)
dom = xml.dom.minidom.parseString(sample)

Briefly: when the HTML includes 'รก' and similar, expressed as named entities, minidom complains

... xml.parsers.expat.ExpatError: undefined entity ...

How should I respond? Do I

  • Replace named entities with corresponding literal constants?
  • Use a parser other than minidom? Which?
  • Somehow (with an encoding assignment?) convince minidom that these named entities are cool?

Not feasible is to convince the author of the (X)HTML to eschew named entities.

Cameron Laird
  • 1,067
  • 5
  • 9
  • 1
    there are many, many previous answers to this question and its ilk, e.g. http://stackoverflow.com/questions/2676872/how-to-parse-malformed-html-in-python-using-standard-libraries โ€“ ekhumoro Oct 13 '11 at 16:37
  • Thank you, ekhumoro; I was so dull that I didn't recognize the customer really is in an HTML situation, and his labeling it XML was just noise I should have ignored. โ€“ Cameron Laird Oct 13 '11 at 16:49

1 Answers1

10

xml.dom.minidom is an XML parser, not an HTML parser. Therefore, it doesn't know any HTML entities (only those which are common to both XML and HTML: &quot;, &amp;, &lt;, &gt; and &apos;).

Try BeautifulSoup.

Tim Pietzcker
  • 328,213
  • 58
  • 503
  • 561
  • 1
    Thanks, Tim Pietzcker: your reply is (extraordinarily!) quick, accurate, and close to what I need. As it happens, the data with which I'm working is *advertised* as XML; over the longer term, I'm going to need to research how to reconcile minidom's idea of the pertinent DTD with that of the data's author. In the meantime, though, I want you to know that your words are helpful: while I'm plenty familiar with BeautifulSoup, your analysis didn't occur to me, and I didn't even think to try it in this situation. You leave me considerably better than when I started. โ€“ Cameron Laird Oct 13 '11 at 16:42