How to deal with utf-8 encoded String and BeautifulSoup?

Question

How can I replace HTML-entities in unicode-Strings with proper unicode?

u'&quot;HAUS Kleider&quot; - &Uuml;ber das Bekleiden und Entkleiden, das Verh&Yuml;llen und Veredeln'

to

u'"HAUS-Kleider" - Über das Bekleiden und Entkleiden, das Verhüllen und Veredeln'

edit
Actually the entities are wrong. At it seems like BeautifulSoup f...ed it up.

So the question is: How to deal with utf-8 encoded String and BeautifulSoup?

from BeautifulSoup import BeautifulSoup

f = open('path_to_file','r')
lines = [i for i in f.readlines()]
soup = BeautifulSoup(''.join(lines))
allArticles = []
for row in rows:
    l =[]
    for r in row.findAll('td'):
            l += [r.string] # here things seem to go wrong
    allArticles+=[l]

Ü -> &Yuml; instead of Ü but actually I don't want the encoding to be changed anyway.

>>> soup.originalEncoding
'utf-8'

but I cant generate a proper unicode string of it

possible duplicate of [Decode HTML entities in Python string?](http://stackoverflow.com/questions/2087370/decode-html-entities-in-python-string) — Wooble, Oct 29 '10 at 18:02
Things seem to go wrong? BeautifulSoup f'ed it up? The entities are wrong? Please try to give more precise details to make this question answerable. BeautifulSoup tends to handle UTF-8 pretty well. — Josh Lee, Oct 29 '10 at 18:20

towi · Answer 1 · 2010-10-29T18:15:33.037

I think what you need are ICU transliterators. I think there is a way to transliterate HTML entities into Unicode.

Try the transliterator id Hex/XML-Any that should to what you want. On the Demo page you can choose "Insert Sample: Compound" and then enter Hex/XML-Any into the "Compound 1" box, add some input data in the box and press "transform". Does this help?

There is a Python ICU binding, but its not taken care of well, I think.

score 1 · Answer 2 · answered Oct 29 '10 at 18:24

1

htmlentitydefs.entitydefs["quot"] returns '"'
That's a dictionary that translates entities to their actual character.
You should be able to continue easily from that point.

answered Oct 29 '10 at 18:24

BlueTrance

11
1

if BeautifulSoup would give me the right entities at all. see my edit – vikingosegundo Oct 29 '10 at 18:25

score 0 · Accepted Answer · answered Oct 29 '10 at 19:24

Ok, the problem was silly, I have to confess. I was working on an old version of rows in the interactive interpreter. I don't know what was wrong with it contents, but this is the correct code:

from BeautifulSoup import BeautifulSoup

f = open('path_to_file','r')
lines = [i for i in f.readlines()]
soup = BeautifulSoup(''.join(lines))
rows = soup.findAll('tr')
allArticles = []
for row in rows:
    l =[]
    for r in row.findAll('td'):
        l += [r.string]
    allArticles+=[l]

shame on me!

How to deal with utf-8 encoded String and BeautifulSoup?

3 Answers3