How do I parse XML that contains HTML entities?

Question

I have a script that takes XML as a string and attempts to parse it using xml

Here is an example of the code I am working with

from xml.etree.ElementTree import fromstring
my_xml = """
    <documents>
          <record>Hello< &O >World</record>
    </documents>
"""
xml = fromstring(my_xml)

When I run the code, I get a ParseError

Traceback (most recent call last):
  File "C:/Code/Python/xml_convert.py", line 7, in <module>
    xml = fromstring(my_xml)
  File "C:\Python27\lib\xml\etree\ElementTree.py", line 1300, in XML
    parser.feed(text)
  File "C:\Python27\lib\xml\etree\ElementTree.py", line 1642, in feed
    self._raiseerror(v)
  File "C:\Python27\lib\xml\etree\ElementTree.py", line 1506, in _raiseerror
    raise err
xml.etree.ElementTree.ParseError: not well-formed (invalid token): line 3, column 18

As stated in Invalid Characters in XML, it is due to having the HTML entities <, >, and &

How may I go about handling these entities so the XML reads them as plain text?

Are you sure that's valid XML? You should probably escape those by using `<` `>` and `&` — kichik, Jan 29 '16 at 21:51
@kichik Hence the reason for my question. The XML is typically escaped, but I encountered a situation where it was not. — Wondercricket, Jan 29 '16 at 21:57
It really depends on the context, but I would reject the XML or fix it manually. If it's coming from an automated system, I would consider fixing that system. — kichik, Jan 29 '16 at 21:58
Obligatory [Perhaps you should use an HTML parser?](http://stackoverflow.com/a/1732454/4532996) — cat, Jan 29 '16 at 23:59

score 4 · Accepted Answer · edited Jan 30 '16 at 00:02

4

You can use the lxml Parser with the recover=True flag:

In [25]: import lxml.etree as ET

In [26]: from lxml.etree import XMLParser

In [27]: my_xml = """
   ....:     <documents>
   ....:           <record>Hello< &O >World</record>
   ....:     </documents>
   ....: """

In [28]: parser = XMLParser(recover=True)

In [29]: element = ET.fromstring(my_xml, parser=parser)

In [30]: for text in element.itertext():
   ....:     print(text)
   ....:     


Hello  >World

edited Jan 30 '16 at 00:02

Charles Duffy

280,126
43
390
441

answered Jan 29 '16 at 22:59

styvane

59,869
19
150
156

1

Hmm. If the OP intends for the invalid content to be read as literals rather than discarded, this might be suboptimal, but then, there may not *be* an optimal choice. – Charles Duffy Jan 30 '16 at 00:05

score 0 · Answer 2 · answered Jan 29 '16 at 23:56

0

You can't do what you're asking. Your document is not well-formed XML, and any conformant XML parser will reject it.

You could write code that used regular expressions to fix it up and make it XML, but any such solution would almost certainly be buggy and error-prone and cause more problems than it solves.

If you really, really can't fix this at the source so the documents are well-formed, then your best option may be to fix them up by hand with human intelligence.

answered Jan 29 '16 at 23:56

Elliotte Rusty Harold

963
7
15

3

[**NO.** Do not ever, **ever** under *any* circumstances use Regex to parse consequentially XHTML.](http://stackoverflow.com/a/1732454/4532996) – cat Jan 30 '16 at 00:03
@cat, I read the answerer here as already endorsing that viewpoint. – Charles Duffy Jan 30 '16 at 00:04
@CharlesDuffy not strongly enough imo – cat Jan 30 '16 at 00:05

How do I parse XML that contains HTML entities?

2 Answers2

Linked