HTML Unescape by converting custom elements to ASCII?

Question

I need to unescape some HTML Entities from a string.

However for few characters like “’“”' I would like to replace with nearest ASCII. All the others needs to be stripped off.

How can I do that in Python ? I tried the following snippet but it doesn't do "nearest" the way I want it.

import HTMLParser
import unicodedata
parser = HTMLParser.HTMLParser()
parsed = parser.unescape("&lsquo;")
nearest = unicodedata.normalize('NFKD', parsed).encode('ascii','ignore')

nearest is empty in the above code. Can I supply an argument to HTMLParser.unescape to convert it to ASCII quotes? I want to supply custom mapping like this : {'&lsquo':'"','&rsquo':'"'} where the items in maps should be converted to ASCII.

xml.sax.parse has some an API unescape(html_text, entities={' ': ' ', """: '"'}), does HTMLParser have something similar.

Rather than map the HTML entity, why not use [`unidecode`](https://pypi.python.org/pypi/Unidecode) here? It'll map quotes to ASCII equivalents for you. — Martijn Pieters, Jan 12 '15 at 12:35
So you mean use a 3rd party library I was hesistant to do that if we had some stdlib solution. Thanks for the pointer however as its something worth checking. — Nishant, Jan 12 '15 at 12:43
Possible duplicate of [Where is Python's "best ASCII for this Unicode" database?](http://stackoverflow.com/questions/816285/where-is-pythons-best-ascii-for-this-unicode-database) — Nishant, Jan 13 '15 at 11:02

HTML Unescape by converting custom elements to ASCII?

0 Answers0