I'm running a Python program which fetches a UTF-8-encoded web page, and I extract some text from the HTML using BeautifulSoup.
However, when I write this text to a file (or print it on the console), it gets written in an unexpected encoding.
Sample program:
import urllib2
from BeautifulSoup import BeautifulSoup
# Fetch URL
url = 'http://www.voxnow.de/'
request = urllib2.Request(url)
request.add_header('Accept-Encoding', 'utf-8')
# Response has UTF-8 charset header,
# and HTML body which is UTF-8 encoded
response = urllib2.urlopen(request)
# Parse with BeautifulSoup
soup = BeautifulSoup(response)
# Print title attribute of a <div> which uses umlauts (e.g. können)
print repr(soup.find('div', id='navbutton_account')['title'])
Running this gives the result:
# u'Hier k\u0102\u015bnnen Sie sich kostenlos registrieren und / oder einloggen!'
But I would expect a Python Unicode string to render ö
in the word können
as \xf6
:
# u'Hier k\xf6bnnen Sie sich kostenlos registrieren und / oder einloggen!'
I've tried passing the 'fromEncoding' parameter to BeautifulSoup, and trying to read()
and decode()
the response
object, but it either makes no difference, or throws an error.
With the command curl www.voxnow.de | hexdump -C
, I can see that the web page is indeed UTF-8 encoded (i.e. it contains 0xc3 0xb6
) for the ö
character:
20 74 69 74 6c 65 3d 22 48 69 65 72 20 6b c3 b6 | title="Hier k..|
6e 6e 65 6e 20 53 69 65 20 73 69 63 68 20 6b 6f |nnen Sie sich ko|
73 74 65 6e 6c 6f 73 20 72 65 67 69 73 74 72 69 |stenlos registri|
I'm beyond the limit of my Python abilities, so I'm at a loss as to how to debug this further. Any advice?