HTML Entity Codes to Text

Question

Does anyone know an easy way in Python to convert a string with HTML entity codes (e.g. < &) to a normal string (e.g. < &)?

cgi.escape() will escape strings (poorly), but there is no unescape().

score 45 · Accepted Answer · edited Jan 30 '18 at 16:15

45

HTMLParser has the functionality in the standard library. It is, unfortunately, undocumented:

(Python2 Docs)

>>> import HTMLParser
>>> h= HTMLParser.HTMLParser()
>>> h.unescape('alpha &lt; &beta;')
u'alpha < \u03b2'

(Python 3 Docs)

>>> import html.parser
>>> h = html.parser.HTMLParser()
>>> h.unescape('alpha &lt; &beta;')
'alpha < \u03b2'

htmlentitydefs is documented, but requires you to do a lot of the work yourself.

If you only need the XML predefined entities (lt, gt, amp, quot, apos), you could use minidom to parse them. If you only need the predefined entities and no numeric character references, you could even just use a plain old string replace for speed.

edited Jan 30 '18 at 16:15

Stefan Collier

4,314
2
23
33

answered Mar 19 '09 at 17:20

bobince

528,062
107
651
834

1

+1 I didn't know that function of HTMLParser – vartec Mar 19 '09 at 17:48
2

Here's a documented function from the standard library that will convert escaped HTML code to a normal string: http://docs.python.org/library/xml.sax.utils.html#xml.sax.saxutils.unescape – Steven T. Snyder Nov 29 '11 at 00:06
In Python 3.4, is was [documented](https://docs.python.org/3/library/html.html#html.unescape). – 9000 Aug 29 '17 at 21:15

score 12 · Answer 2 · answered Mar 19 '09 at 17:45

12

I forgot to tag it at first, but I'm using BeautifulSoup.

Digging around in the documentation, I found:

soup = BeautifulSoup(html, convertEntities=BeautifulSoup.HTML_ENTITIES)

does it exactly as I was hoping.

answered Mar 19 '09 at 17:45

tghw

25,208
13
70
96

4

This only works for BeautifulSoup versions pre-BS4. If you are using BS4, you must use a formatter: http://www.crummy.com/software/BeautifulSoup/bs4/doc/#output-formatters – kronion Jun 16 '13 at 18:03
2

it does not work for & Actually, if a string contains '&', BeautifulSoup converts it back to &, which is the opposite of what I was hoping. – Dennis Golomazov Sep 13 '13 at 14:55

score 1 · Answer 3 · answered Mar 19 '09 at 17:03

1

There is nothing built into the Python stdlib to unescape HTML, but there's a short script you can tailor to your needs at http://www.w3.org/QA/2008/04/unescape-html-entities-python.html.

answered Mar 19 '09 at 17:03

Benjamin Pollack

27,594
16
81
105

There is a thing built into the Python stdlib to unescape HTML. See the accepted answer and edit your answer please. – Ekrem Dinçel Jul 28 '20 at 19:29

vartec · Answer 4 · 2009-03-19T17:52:46.600

1

Use htmlentitydefs module. This my old code, it worked, but I'm sure there is cleaner and more pythonic way to do it:

e2c = dict(('&%s;'%k,eval("u'\\u%04x'"%v)) for k, v in htmlentitydefs.name2codepoint.items())

edited Mar 19 '09 at 17:52

answered Mar 19 '09 at 17:22

vartec

131,205
36
218
244

HTML Entity Codes to Text

4 Answers4

Linked

Related