I summarize: minidom appears not to like 8859 named entities; what's an appropriate resolution?
Here's code which illustrates my situation:
sample = """
<html>
<body>
<h1>Un ejemplo</h1>
<p>Me llamo Juan Fulano y Hernández.</p>
</body>
</html>
"""
sample2 = sample.replace("á", "á")
import xml.dom.minidom
dom2 = xml.dom.minidom.parseString(sample2)
dom = xml.dom.minidom.parseString(sample)
Briefly: when the HTML includes 'รก' and similar, expressed as named entities, minidom complains
... xml.parsers.expat.ExpatError: undefined entity ...
How should I respond? Do I
- Replace named entities with corresponding literal constants?
- Use a parser other than minidom? Which?
- Somehow (with an encoding assignment?) convince minidom that these named entities are cool?
Not feasible is to convince the author of the (X)HTML to eschew named entities.