How do you convert HTML entities to Unicode and vice versa in Python?
-
17@Jarret Hardie: Actually, show-and-tell is perfectly fine on SO. From the first entry on the FAQ (http://stackoverflow.com/faq) "It's also perfectly fine to ask and answer your own programming question". Although, it's also encouraged to look for duplicates as well. – chauncey Mar 31 '09 at 16:13
-
13I am posting questions that I have answered for myself in the past for the benefit of other users searching for similar answers. – hekevintran Mar 31 '09 at 16:25
-
6+1 He is contributing to the dataset. – Ryan Townshend Apr 02 '09 at 18:30
-
2This question is wider in scope than then one pointed to by the "duplicate" link: this question also asks for "vice versa", i.e., from Unicode to HTML entities. – Vebjorn Ljosa Sep 24 '09 at 10:52
-
Can also be done without external libraries. See http://stackoverflow.com/questions/663058/html-entity-codes-to-text/663128#663128 – bobince Mar 31 '09 at 16:31
9 Answers
As to the "vice versa" (which I needed myself, leading me to find this question, which didn't help, and subsequently another site which had the answer):
u'some string'.encode('ascii', 'xmlcharrefreplace')
will return a plain string with any non-ascii characters turned into XML (HTML) entities.

- 10,668
- 5
- 59
- 68
-
1I've forgotten about xmlcharrefreplace and this was very helpful. Any time I need to safely store encoded or non-ascii characters to mysql I find I need to use this method. – cybertoast Feb 02 '12 at 20:36
-
1This doesn't work with a string literal containing the unicode character U+2019 HTML entity equivalent ’ Isn't this what the question was asking for (this answer converts ascii which is a subset of unicode)? text.decode('utf-8').encode('ascii', 'xmlcharrefreplace') – Mike S Jul 07 '14 at 20:26
-
1@MikeS It works without problem; `>>> u'\u2019'.encode('utf-8').decode('utf-8').encode('ascii', 'xmlcharrefreplace')` gives `'’'` – Piotr Dobrogost Jun 06 '16 at 11:46
You need to have BeautifulSoup.
from BeautifulSoup import BeautifulStoneSoup
import cgi
def HTMLEntitiesToUnicode(text):
"""Converts HTML entities to unicode. For example '&' becomes '&'."""
text = unicode(BeautifulStoneSoup(text, convertEntities=BeautifulStoneSoup.ALL_ENTITIES))
return text
def unicodeToHTMLEntities(text):
"""Converts unicode to HTML entities. For example '&' becomes '&'."""
text = cgi.escape(text).encode('ascii', 'xmlcharrefreplace')
return text
text = "&, ®, <, >, ¢, £, ¥, €, §, ©"
uni = HTMLEntitiesToUnicode(text)
htmlent = unicodeToHTMLEntities(uni)
print uni
print htmlent
# &, ®, <, >, ¢, £, ¥, €, §, ©
# &, ®, <, >, ¢, £, ¥, €, §, ©

- 22,822
- 32
- 111
- 180
-
2The BeautifulSoup api has changed. Please see the most recent [doc](http://www.crummy.com/software/BeautifulSoup/bs4/doc/). – scharfmn Mar 03 '15 at 06:22
-
@hekevintran: Is it possible to print '¢, £, ¥, €, §, ©' instead of '¢, £, ¥, €, §, ©'. Any idea? – Jagath Aug 05 '16 at 07:49
-
9
Update for Python 2.7 and BeautifulSoup4
Unescape -- Unicode HTML to unicode with htmlparser
(Python 2.7 standard lib):
>>> escaped = u'Monsieur le Curé of the «Notre-Dame-de-Grâce» neighborhood'
>>> from HTMLParser import HTMLParser
>>> htmlparser = HTMLParser()
>>> unescaped = htmlparser.unescape(escaped)
>>> unescaped
u'Monsieur le Cur\xe9 of the \xabNotre-Dame-de-Gr\xe2ce\xbb neighborhood'
>>> print unescaped
Monsieur le Curé of the «Notre-Dame-de-Grâce» neighborhood
Unescape -- Unicode HTML to unicode with bs4
(BeautifulSoup4):
>>> html = '''<p>Monsieur le Curé of the «Notre-Dame-de-Grâce» neighborhood</p>'''
>>> from bs4 import BeautifulSoup
>>> soup = BeautifulSoup(html)
>>> soup.text
u'Monsieur le Cur\xe9 of the \xabNotre-Dame-de-Gr\xe2ce\xbb neighborhood'
>>> print soup.text
Monsieur le Curé of the «Notre-Dame-de-Grâce» neighborhood
Escape -- Unicode to unicode HTML with bs4
(BeautifulSoup4):
>>> unescaped = u'Monsieur le Curé of the «Notre-Dame-de-Grâce» neighborhood'
>>> from bs4.dammit import EntitySubstitution
>>> escaper = EntitySubstitution()
>>> escaped = escaper.substitute_html(unescaped)
>>> escaped
u'Monsieur le Curé of the «Notre-Dame-de-Grâce» neighborhood'

- 3,561
- 7
- 38
- 53
-
3upvote for showing a standard library solution with no dependencies – Hartley Brody Jul 21 '16 at 15:58
-
Revisiting I just saw the comment @bobince left on the question pointing to [this answer](http://stackoverflow.com/a/663128/1599229). Since `htmlparser` is documented now, and since that comment is not prominent, leaving that part of answer. – scharfmn Jul 21 '16 at 17:02
As hekevintran answer suggests, you may use cgi.escape(s)
for encoding stings, but notice that encoding of quote is false by default in that function and it may be a good idea to pass the quote=True
keyword argument alongside your string. But even by passing quote=True
, the function won't escape single quotes ("'"
) (Because of these issues the function has been deprecated since version 3.2)
It's been suggested to use html.escape(s)
instead of cgi.escape(s)
. (New in version 3.2)
Also html.unescape(s)
has been introduced in version 3.4.
So in python 3.4 you can:
- Use
html.escape(text).encode('ascii', 'xmlcharrefreplace').decode()
to convert special characters to HTML entities. - And
html.unescape(text)
for converting HTML entities back to plain-text representations.

- 8,198
- 6
- 62
- 63
For python3
use html.unescape()
:
import html
s = "&"
u = html.unescape(s)
# &

- 94,083
- 31
- 258
- 268
$ python3 -c "
> import html
> print(
> html.unescape('&©—')
> )"
&©—
$ python3 -c "
> import html
> print(
> html.escape('&©—')
> )"
&©—
$ python2 -c "
> from HTMLParser import HTMLParser
> print(
> HTMLParser().unescape('&©—')
> )"
&©—
$ python2 -c "
> import cgi
> print(
> cgi.escape('&©—')
> )"
&©—
HTML only strictly requires &
(ampersand) and <
(left angle bracket / less-than sign) to be escaped. https://html.spec.whatwg.org/multipage/parsing.html#data-state

- 1,435
- 14
- 20
If someone like me is out there wondering why some entity numbers (codes) like ™ (for trademark symbol), € (for euro symbol)
are not encoded properly, the reason is in ISO-8859-1 (aka Windows-1252) those characters are not defined.
Also note that, the default character set as of html5 is utf-8 it was ISO-8859-1 for html4
So, we will have to workaround somehow (find & replace those at first)
Reference (starting point) from Mozilla's documentation
https://developer.mozilla.org/en-US/docs/Web/Guide/Localizations_and_character_encodings

- 377
- 1
- 4
- 16
I used the following function to convert unicode ripped from an xls file into a an html file while conserving the special characters found in the xls file:
def html_wr(f, dat):
''' write dat to file f as html
. file is assumed to be opened in binary format
. if dat is nul it is replaced with non breakable space
. non-ascii characters are translated to xml
'''
if not dat:
dat = ' '
try:
f.write(dat.encode('ascii'))
except:
f.write(html.escape(dat).encode('ascii', 'xmlcharrefreplace'))
hope this is useful to somebody

- 394
- 1
- 2
- 11
#!/usr/bin/env python3
import fileinput
import html
for line in fileinput.input():
print(html.unescape(line.rstrip('\n')))

- 3,439
- 2
- 24
- 43