Your string contains HTML-Entities (whether it's XML or HTML) and needs to be un-escaped. The ‘
and ’
correlate to ‘
and ’
, respectively.
If you use html.unescape
, you'll see the cleaned up text:
>>> import html
>>> html.unescape('<tag>Sample ‘cp 99-3a’</tag>')
'<tag>Sample ‘cp 99-3a’</tag>'
Edit: @mzjn pointed out that you can also fix the string by adding a missing semicolon to the 2nd Entity:
>>> import xml.etree.ElementTree as ET
>>> tag = ET.fromstring('<tag>Sample ‘cp 99-3a’</tag>')
>>> tag.text
'Sample \x91cp 99-3a\x92'
But, you'll see that there's still \x91
and \x92
characters (and requires that you can control the string's content). These are the MS CP1252 encodings for left-and-right single quotation marks. Using the html.unescape
method above will still give you the cleaned up text.
Comment Follow-up
In your comment, you added the additional wrinkle of your string containing other valid XML escape sequences (such as &
), which html.unescape
will happily clean. Unfortunately, as you saw, that ends up taking you back to square one, as you now have have an &
that should be escaped, but isn't (ElementTree
would un-escape it for you).
>>> import html
>>> import xml.etree.ElementTree as ET
>>> cleaned = html.unescape('<tag>&Sample ‘cp 99-3a’</tag>')
>>> print(cleaned)
<tag>&Sample ‘cp 99-3a’</tag>
>>> ET.fromstring(cleaned)
Traceback (most recent call last):
...
ParseError: not well-formed (invalid token): line 1, column 12
Some other options you have are to try using soupparser
from lxml.html
, which is much better at handling problematic HTML/XML:
>>> from lxml.html import soupparser
>>> soupparser.fromstring('<tag>&Sample ‘cp 99-3 a’</tag>').text_content()
'&Sample ‘cp 99-3 a’'
Or depending on what your needs are, you might just be better off doing a string/regex replace before parsing it to remove the annoying cp1252 characters:
>>> import re
# Matches "‘" or "’", with or without trailing semicolon
>>> node = ET.fromstring(re.sub(r'	[1-2];?', "'", '<tag>&Sample ‘cp 99-3 a’</tag>'))
>>> node.text
"&Sample 'cp 99-3 a'"