I need to unescape some HTML Entities from a string.
However for few characters like “’“”'
I would like to replace with nearest ASCII. All the others needs to be stripped off.
How can I do that in Python ? I tried the following snippet but it doesn't do "nearest" the way I want it.
import HTMLParser
import unicodedata
parser = HTMLParser.HTMLParser()
parsed = parser.unescape("‘")
nearest = unicodedata.normalize('NFKD', parsed).encode('ascii','ignore')
nearest
is empty in the above code. Can I supply an argument to HTMLParser.unescape to convert it to ASCII quotes? I want to supply custom mapping like this : {'&lsquo':'"','&rsquo':'"'} where the items in maps should be converted to ASCII.
xml.sax.parse
has some an API unescape(html_text, entities={' ': ' ', """: '"'})
, does HTMLParser have something similar.