4

I'm trying to extract text and HTML from a website with Scandinavian characters using Beautiful Soup and Python 2.6.5.

html = open('page.html', 'r').read()
soup = BeautifulSoup(html)

descriptions = soup.findAll(attrs={'class' : 'description' })

for i in descriptions:
    description_html = i.a.__str__()
    description_text = i.a.text.__str__()
    description_html = description_html.replace("/subdir/", "http://www.domain.com/subdir/")
    print description_html

However when executed, the program fails with the following error message:

Traceback (most recent call last):
    File "test01.py", line 40, in <module>
        description_text = i.a.text.__str__()
UnicodeEncodeError: 'ascii' codec can't encode character u'\xe4' in position 19:         ordinal not in range(128)

The input page seems to be encoded in ISO-8859-1, if that's any help. I tried setting the correct source encoding with BeautifulSoup(html, fromEncoding="latin-1") but it didn't help either.

It's year 2011 and I'm wrestling with trivial character encoding problems, I believe there's a really simple solution to all this.

agf
  • 171,228
  • 44
  • 289
  • 238

2 Answers2

5
i.a.__str__('latin-1')

or

i.a.text.encode('latin-1')

should work.

Are you sure it's latin-1? It should detect the encoding correctly.

Also, why not just use str(i.a) if it happens you don't need to specify an encoding?

Edit: Looks like you need to install chardet for it to automatically detect encodings.

agf
  • 171,228
  • 44
  • 289
  • 238
0

I was having the same problem with Beautiful Soup failing to output a line of text containing German characters. Unfortunately there are a myriad of answers even on stackoverflow that didn't solve my problem:

        title = str(link.contents[0].string)  

This gave 'UnicodeEncodeError: 'ascii codec can't encode character u'\xe4' in position 32: ordinal not in range(128)

Many answers do have valuable pointers as to a correct solution. As Lennart Regebro says at UnicodeEncodeError: 'ascii' codec can't encode character u'\u2013' in position 3 2: ordinal not in range(128):

When you do str(u'\u2013') you are trying to convert the Unicode string to a 8-bit string. To do this you need to use an encoding, a mapping between Unicode data to 8-bit data. What str() does is that is uses the system default encoding, which under Python 2 is ASCII. ASCII contains only the 127 first code points of Unicode, that is \u0000 to \u007F1. The result is that you get the above error, the ASCII codec just doesn't know what \u2013 is (it's a long dash, btw).

For me, it was a simple case of not using str() to convert a Beautiful Soup object to string format. Fiddling with the console's default output made no difference either.

            ### title = str(link.contents[0].string)
            ### should be
            title = link.contents[0].encode('utf-8')
Community
  • 1
  • 1
Hektor
  • 1,845
  • 15
  • 19