1

I have a script that takes XML as a string and attempts to parse it using xml

Here is an example of the code I am working with

from xml.etree.ElementTree import fromstring
my_xml = """
    <documents>
          <record>Hello< &O >World</record>
    </documents>
"""
xml = fromstring(my_xml)

When I run the code, I get a ParseError

Traceback (most recent call last):
  File "C:/Code/Python/xml_convert.py", line 7, in <module>
    xml = fromstring(my_xml)
  File "C:\Python27\lib\xml\etree\ElementTree.py", line 1300, in XML
    parser.feed(text)
  File "C:\Python27\lib\xml\etree\ElementTree.py", line 1642, in feed
    self._raiseerror(v)
  File "C:\Python27\lib\xml\etree\ElementTree.py", line 1506, in _raiseerror
    raise err
xml.etree.ElementTree.ParseError: not well-formed (invalid token): line 3, column 18

As stated in Invalid Characters in XML, it is due to having the HTML entities <, >, and &

How may I go about handling these entities so the XML reads them as plain text?

Community
  • 1
  • 1
Wondercricket
  • 7,651
  • 2
  • 39
  • 58
  • Are you sure that's valid XML? You should probably escape those by using `<` `>` and `&` – kichik Jan 29 '16 at 21:51
  • @kichik Hence the reason for my question. The XML is typically escaped, but I encountered a situation where it was not. – Wondercricket Jan 29 '16 at 21:57
  • 2
    It really depends on the context, but I would reject the XML or fix it manually. If it's coming from an automated system, I would consider fixing that system. – kichik Jan 29 '16 at 21:58
  • Obligatory [Perhaps you should use an HTML parser?](http://stackoverflow.com/a/1732454/4532996) – cat Jan 29 '16 at 23:59

2 Answers2

4

You can use the lxml Parser with the recover=True flag:

In [25]: import lxml.etree as ET

In [26]: from lxml.etree import XMLParser

In [27]: my_xml = """
   ....:     <documents>
   ....:           <record>Hello< &O >World</record>
   ....:     </documents>
   ....: """

In [28]: parser = XMLParser(recover=True)

In [29]: element = ET.fromstring(my_xml, parser=parser)

In [30]: for text in element.itertext():
   ....:     print(text)
   ....:     


Hello  >World
Charles Duffy
  • 280,126
  • 43
  • 390
  • 441
styvane
  • 59,869
  • 19
  • 150
  • 156
  • 1
    Hmm. If the OP intends for the invalid content to be read as literals rather than discarded, this might be suboptimal, but then, there may not *be* an optimal choice. – Charles Duffy Jan 30 '16 at 00:05
0

You can't do what you're asking. Your document is not well-formed XML, and any conformant XML parser will reject it.

You could write code that used regular expressions to fix it up and make it XML, but any such solution would almost certainly be buggy and error-prone and cause more problems than it solves.

If you really, really can't fix this at the source so the documents are well-formed, then your best option may be to fix them up by hand with human intelligence.