Trouble with parsing HTML with unicodes through Beautiful Soup

Question

Beautiful Soup doesn't seem to work properly(for me) in case HTML contains unicodes whose ascii exceeds 128. What suitable decoding-encoding should be used for this ?

raw = open('index.html').read() BeautifulSoup.BeautifulSoup(raw)

Error

...stacktrace... UnicodeEncodeError: 'ascii' codec can't encode character u'\xe9' in position 8094: ordinal not in range(128)

score 1 · Answer 1 · answered Oct 14 '11 at 15:24

1

The problem is not with parsing the file. Using the link you gave in your comment to Marco, doing soup = BeautifulSoup(urllib.urlopen(your_link)) works absolutely fine.

It's just when you try and print that parsed data to the console that you get a problem, because it's now been converted to Unicode, and Python will try and output that as ASCII unless you tell it otherwise. So doing print soup rather than just soup in your console will work.

answered Oct 14 '11 at 15:24

Daniel Roseman

588,541
66
880
895

how would you resolve this if you can't use the **print** statement? (see more here: http://stackoverflow.com/questions/7769745/python-convert-and-save-unicode-string-to-a-list) – Marco L. Oct 14 '11 at 16:01
You don't need to, that's the whole point. It's only a problem when you're outputting in the console. – Daniel Roseman Oct 14 '11 at 16:21

Trouble with parsing HTML with unicodes through Beautiful Soup

1 Answers1