(Edit: I'm using Python 2.7) (Edit 2: I have already checked Convert XML/HTML Entities into Unicode String in Python, the solutions do not work. Please do not flag this as already answered.)
I've been unable to find a python package that can reliably convert text with some html entities in it. I've found that HTMLParser works for some stuff but also breaks a lot. BeautifulSoup never seems to work for converting to unicode. How can I return a unicode representation of strings a-d using only one method?
I think the problem I'm having is that some of my text has both unicode characters and html entities (as in example string d).
import HTMLParser
from bs4 import BeautifulSoup
astring = "P&O."
bstring = "& "
cstring = ">"
dstring = "> 150ÎC"
pars = HTMLParser.HTMLParser()
a1 = BeautifulSoup(astring)
a2 = pars.unescape(astring)
print "a1:", a1
print "a2:", a2
b1 = BeautifulSoup(bstring)
b2 = pars.unescape(bstring)
print "b1:", b1
print "b2:", b2
c1 = BeautifulSoup(cstring)
c2 = pars.unescape(cstring)
print "c1:", c1
print "c2:", c2
d1 = BeautifulSoup(dstring)
try: d2 = pars.unescape(dstring)
except:d2 = "HTML Parse Broke!"
print "d1:", d1
print "d2:", d2
which gives the following output:
a1: P&O.
a2: P&O.
b1: &
b2: &
c1: >
c2: >
d1: > 150ÎC
d2: HTML Parse Broke!
Edit 3: kalhartt's suggestion lead me to a solution. To keep the strings with mixed character encoding from breaking I used .decode('utf-8')