Python, convert HTML entities to Unicode

Question

(Edit: I'm using Python 2.7) (Edit 2: I have already checked Convert XML/HTML Entities into Unicode String in Python, the solutions do not work. Please do not flag this as already answered.)

I've been unable to find a python package that can reliably convert text with some html entities in it. I've found that HTMLParser works for some stuff but also breaks a lot. BeautifulSoup never seems to work for converting to unicode. How can I return a unicode representation of strings a-d using only one method?

I think the problem I'm having is that some of my text has both unicode characters and html entities (as in example string d).

import HTMLParser
from bs4 import BeautifulSoup

astring = "P&amp;O."
bstring = "&amp; "
cstring = "&gt;"
dstring = "&gt; 150ÎC"

pars = HTMLParser.HTMLParser()
a1 = BeautifulSoup(astring)
a2 = pars.unescape(astring)
print "a1:", a1
print "a2:", a2
b1 = BeautifulSoup(bstring)
b2 = pars.unescape(bstring)
print "b1:", b1
print "b2:", b2
c1 = BeautifulSoup(cstring)
c2 = pars.unescape(cstring)
print "c1:", c1
print "c2:", c2
d1 = BeautifulSoup(dstring)
try: d2 = pars.unescape(dstring)
except:d2 = "HTML Parse Broke!"
print "d1:", d1
print "d2:", d2

which gives the following output:

a1: P&amp;O.
a2: P&O.
b1: &amp; 
b2: & 
c1: &gt;
c2: >
d1: &gt; 150ÎC
d2: HTML Parse Broke!

Edit 3: kalhartt's suggestion lead me to a solution. To keep the strings with mixed character encoding from breaking I used .decode('utf-8')

Have you [read this](http://stackoverflow.com/questions/57708/convert-xml-html-entities-into-unicode-string-in-python)? — Tim Peters, Oct 08 '13 at 02:24
There are 7 answers in the message I referenced. Did you try all of them? ;-) — Tim Peters, Oct 08 '13 at 02:35

score 1 · Accepted Answer · answered Oct 08 '13 at 03:24

If you want unicode handling, use unicode strings. Everything works as expected in your example then.

# -*- coding: utf-8 -*-
import HTMLParser
from bs4 import BeautifulSoup

astring = u"P&amp;O."
bstring = u"&amp; "
cstring = u"&gt;"
dstring = u"&gt; 150ÎC"

pars = HTMLParser.HTMLParser()
a1 = BeautifulSoup('<span>%s</span>' % astring)
a2 = pars.unescape(astring)
print "a1:", a1
print "a2:", a2
b1 = BeautifulSoup('<span>%s</span>' % bstring)
b2 = pars.unescape(bstring)
print "b1:", b1
print "b2:", b2
c1 = BeautifulSoup('<span>%s</span>' % cstring)
c2 = pars.unescape(cstring)
print "c1:", c1
print "c2:", c2
d1 = BeautifulSoup('<span>%s</span>' % dstring)
try: d2 = pars.unescape(dstring)
except: d2 = "HTML Parse Broke!"
print "d1:", d1
print "d2:", d2

This gives the following output.

a1: <span>P&amp;O.</span>
a2: P&O.
b1: <span>&amp; </span>
b2: & 
c1: <span>&gt;</span>
c2: >
d1: <span>&gt; 150ÎC</span>
d2: > 150ÎC

BeautifulSoup encodes them, HTMLParser decodes them.

Python, convert HTML entities to Unicode

1 Answers1