1

I need to unescape some HTML Entities from a string.

However for few characters like “’“”' I would like to replace with nearest ASCII. All the others needs to be stripped off.

How can I do that in Python ? I tried the following snippet but it doesn't do "nearest" the way I want it.

import HTMLParser
import unicodedata
parser = HTMLParser.HTMLParser()
parsed = parser.unescape("‘")
nearest = unicodedata.normalize('NFKD', parsed).encode('ascii','ignore')

nearest is empty in the above code. Can I supply an argument to HTMLParser.unescape to convert it to ASCII quotes? I want to supply custom mapping like this : {'&lsquo':'"','&rsquo':'"'} where the items in maps should be converted to ASCII.

xml.sax.parse has some an API unescape(html_text, entities={' ': ' ', """: '"'}), does HTMLParser have something similar.

Nishant
  • 20,354
  • 18
  • 69
  • 101
  • 2
    Rather than map the HTML entity, why not use [`unidecode`](https://pypi.python.org/pypi/Unidecode) here? It'll map quotes to ASCII equivalents for you. – Martijn Pieters Jan 12 '15 at 12:35
  • So you mean use a 3rd party library I was hesistant to do that if we had some stdlib solution. Thanks for the pointer however as its something worth checking. – Nishant Jan 12 '15 at 12:43
  • Possible duplicate of [Where is Python's "best ASCII for this Unicode" database?](http://stackoverflow.com/questions/816285/where-is-pythons-best-ascii-for-this-unicode-database) – Nishant Jan 13 '15 at 11:02

0 Answers0