I'm trying to extract text and HTML from a website with Scandinavian characters using Beautiful Soup and Python 2.6.5.
html = open('page.html', 'r').read()
soup = BeautifulSoup(html)
descriptions = soup.findAll(attrs={'class' : 'description' })
for i in descriptions:
description_html = i.a.__str__()
description_text = i.a.text.__str__()
description_html = description_html.replace("/subdir/", "http://www.domain.com/subdir/")
print description_html
However when executed, the program fails with the following error message:
Traceback (most recent call last):
File "test01.py", line 40, in <module>
description_text = i.a.text.__str__()
UnicodeEncodeError: 'ascii' codec can't encode character u'\xe4' in position 19: ordinal not in range(128)
The input page seems to be encoded in ISO-8859-1, if that's any help. I tried setting the correct source encoding with BeautifulSoup(html, fromEncoding="latin-1")
but it didn't help either.
It's year 2011 and I'm wrestling with trivial character encoding problems, I believe there's a really simple solution to all this.