1

I'm using Python 3 to retrieve data from an API but I have problems parsing some XML documents from retrieved strings.

I have identified the specific string which cause this problem:

from xml.etree import ElementTree

bad_string = '<tag>Sample &#x91;cp 99-3a&#x92</tag>'
ElementTree.fromstring(bad_string)

This is the returned error which stops the script:

ParseError: not well-formed (invalid token): line 1, column 31

I tried to solve it using some solutions like the one included below with the same result as before

ElementTree.fromstring('<tag>Sample &#x91;cp 99-3a&#x92</tag>'.encode('ascii', 'ignore'))

How can I clean this string without applying one specific regular expression to face other similar strings?

Edit: Now that @b_c and @mzjn explain that my problem are unescaped characters I find one possible solution (Escape unescaped characters in XML with Python)

ElementTree.fromstring('<tag>&amp;Sample &#x91;cp 99-3a&#x92</tag>', parser = etree.XMLParser(recover = True))
Wences
  • 71
  • 8
  • `’` is the problem. If it had a semicolon at the end (`’`) it would be a correct numerical character reference. See https://en.wikipedia.org/wiki/List_of_XML_and_HTML_character_entity_references#Character_reference_overview. – mzjn Oct 23 '19 at 18:10

1 Answers1

1

Your string contains HTML-Entities (whether it's XML or HTML) and needs to be un-escaped. The &#x91; and &#x92 correlate to and , respectively.

If you use html.unescape, you'll see the cleaned up text:

>>> import html
>>> html.unescape('<tag>Sample &#x91;cp 99-3a&#x92</tag>')
'<tag>Sample ‘cp 99-3a’</tag>'

Edit: @mzjn pointed out that you can also fix the string by adding a missing semicolon to the 2nd Entity:

>>> import xml.etree.ElementTree as ET
>>> tag = ET.fromstring('<tag>Sample &#x91;cp 99-3a&#x92;</tag>')
>>> tag.text
'Sample \x91cp 99-3a\x92'

But, you'll see that there's still \x91 and \x92 characters (and requires that you can control the string's content). These are the MS CP1252 encodings for left-and-right single quotation marks. Using the html.unescape method above will still give you the cleaned up text.

Comment Follow-up

In your comment, you added the additional wrinkle of your string containing other valid XML escape sequences (such as &amp;), which html.unescape will happily clean. Unfortunately, as you saw, that ends up taking you back to square one, as you now have have an & that should be escaped, but isn't (ElementTree would un-escape it for you).

>>> import html
>>> import xml.etree.ElementTree as ET
>>> cleaned = html.unescape('<tag>&amp;Sample &#x91;cp 99-3a&#x92</tag>')
>>> print(cleaned)
<tag>&Sample ‘cp 99-3a’</tag>
>>> ET.fromstring(cleaned)
Traceback (most recent call last):
  ...
ParseError: not well-formed (invalid token): line 1, column 12

Some other options you have are to try using soupparser from lxml.html, which is much better at handling problematic HTML/XML:

>>> from lxml.html import soupparser
>>> soupparser.fromstring('<tag>&amp;Sample &#x91;cp 99-3 a&#x92;</tag>').text_content()
'&Sample ‘cp 99-3 a’'

Or depending on what your needs are, you might just be better off doing a string/regex replace before parsing it to remove the annoying cp1252 characters:

>>> import re
# Matches "&#x91" or "&#x92", with or without trailing semicolon
>>> node = ET.fromstring(re.sub(r'&#x9[1-2];?', "'", '<tag>&amp;Sample &#x91;cp 99-3 a&#x92;</tag>'))
>>> node.text
"&Sample 'cp 99-3 a'"
b_c
  • 1,202
  • 13
  • 24
  • The semicolon is irrelevant, at least for `html.unescape` apparently. And yes, they _are_ [HTML Entities](https://www.w3schools.com/html/html_entities.asp) – b_c Oct 23 '19 at 18:08
  • 1
    It is misleading to say "it's HTML with HTML-Entities". `bad_string` in the question would be well-formed XML if there was a semicolon after `’`. – mzjn Oct 23 '19 at 18:09
  • A lot of thanks @b_c and @mzjn, both solutions are valid but now I have another problem with `&`. When I run for instance `ElementTree.fromstring(html.unescape('&Sample ‘cp 99-3a’'))` I have the same problem as before. – Wences Oct 23 '19 at 19:34
  • Updated with some additional options based on that :) – b_c Oct 23 '19 at 20:17
  • 1
    Awesome! This is the solution that I was looking for, thanks! – Wences Oct 23 '19 at 21:57