How to Parse HTML with Non-ASCII Characters using BeautifulSoup?

Question

I keep getting the following error when trying to parse some html using BeautifulSoup:

UnicodeDecodeError: 'ascii' codec can't decode byte 0xae in position 0: ordinal not in range(128)

I've tried decoding the html using the solution to the questions below, but keep getting the same error. I've tried all the solutions to the questions below but none of them work (posting so that I don't get duplicate answers and in case they help anyone to find a solution by viewing related approaches to the problem).

Anybody know where I'm going wrong here? Is this a bug in BeautifulSoup and should I install an earlier version?

EDIT: code and traceback below:

from BeautifulSoup import BeautifulSoup as bs
soup = bs(html)

Traceback (most recent call last):
  File "<console>", line 1, in <module>
  File "/var/lib/python-support/python2.5/BeautifulSoup.py", line 1282, in __init__
    BeautifulStoneSoup.__init__(self, *args, **kwargs)
  File "/var/lib/python-support/python2.5/BeautifulSoup.py", line 946, in __init__
    self._feed()
  File "/var/lib/python-support/python2.5/BeautifulSoup.py", line 971, in _feed
    SGMLParser.feed(self, markup)
  File "/usr/lib/python2.5/sgmllib.py", line 99, in feed
    self.goahead(0)
  File "/usr/lib/python2.5/sgmllib.py", line 133, in goahead
    k = self.parse_starttag(i)
  File "/usr/lib/python2.5/sgmllib.py", line 285, in parse_starttag
    self._convert_ref, attrvalue)
UnicodeDecodeError: 'ascii' codec can't decode byte 0xae in position 0: ordinal not in range(128)

EDIT: error message per comment below:

Traceback (most recent call last):
  File "<console>", line 1, in <module>
  File "/var/lib/python-support/python2.5/BeautifulSoup.py", line 1282, in __init__
    BeautifulStoneSoup.__init__(self, *args, **kwargs)
  File "/var/lib/python-support/python2.5/BeautifulSoup.py", line 946, in __init__
    self._feed()
  File "/var/lib/python-support/python2.5/BeautifulSoup.py", line 971, in _feed
    SGMLParser.feed(self, markup)
  File "/usr/lib/python2.5/sgmllib.py", line 99, in feed
    self.goahead(0)
  File "/usr/lib/python2.5/sgmllib.py", line 133, in goahead
    k = self.parse_starttag(i)
  File "/usr/lib/python2.5/sgmllib.py", line 285, in parse_starttag
    self._convert_ref, attrvalue)
UnicodeDecodeError: 'ascii' codec can't decode byte 0xae in position 0: ordinal not in range(128)

Thanks for your help!

'ascii' codec error in beautifulsoup

UnicodeEncodeError: 'ascii' codec can't encode character u'\xef' in position 0: ordinal not in range(128)

How do I convert a file's format from Unicode to ASCII using Python?

python UnicodeEncodeError > How can I simply remove troubling unicode characters?

UnicodeEncodeError: 'ascii' codec can't encode character u'\xef' in position 0: ordinal not in range(128)

I did not downvote. I just think that you should show **exactly what you did**, instead of just throwing a bunch of links at us. — mzjn, Jul 17 '11 at 18:50
Missing link: that of "some html". Also: what version of BS, what version of Python? Please show the traceback. — John Machin, Jul 17 '11 at 19:24
You are getting Unicode **De** codeError and you think that 3 links that mention Unicode **En** codeError will be helpful? — John Machin, Jul 17 '11 at 19:29
@rolling stone: links to UnicodeEncodeError are unhelpful because they are totally irrelevant to your problem. I *AM* trying to be helpful: asking you to supply basic facts that you should have supplied without being asked: traceback, versions, input data. More matter and less attitude, please. — John Machin, Jul 17 '11 at 20:03
@John Machin python version 2.5, BS version 3.0.4. Added traceback. I don't see how the HTML code is relevant or considered a missing link. It's a common problem to parse non-ASCII characters using BeautifulSoup (see http://www.crummy.com/software/BeautifulSoup/documentation.html#Why%20can%27t%20Beautiful%20Soup%20print%20out%20the%20non-ASCII%20characters%20I%20gave%20it?). In all the questions I've looked at on SO none of several dozen answerers have ever needed the exact HTML code to come up with a solution. — rolling stone, Jul 17 '11 at 20:07
@rolling stone: The problem may be that BS has guessed the encoding wrongly. This can happen with short data, for example. None of the never-needed-to-see-the-HTML answers have provided a solution to *your problem*, have they? — John Machin, Jul 17 '11 at 20:20
@rolling stone: How do you know what is the correct encoding to specify?? The first link that you gave suggested explicitly specifying the encoding ... you didn't try it?? — John Machin, Jul 17 '11 at 20:35

John Machin · Accepted Answer · 2011-07-18T09:08:49.000

2

You say in a comment: """I just looked up the content-type of the html I'm trying to parse to see if it was something I hadn't tried (earlier I just assumed it was UTF-8) but sure enough it was UTF-8 so another dead end."""

Sigh. This is exactly why I have been trying to get you to divulge the HTML that you are trying to parse. The error message indicates that the (first) problem byte is \xae which is definitely NOT a valid lead byte in a UTF-8 sequence.

Either divulge the link to your HTML, or do some basic debugging:

Does uc = html.decode('utf8') work or fail? If fail, with what error message?

You also said: """I'm starting to think this is a bug in BS, which they allude to in the docs, and can be seen here: crummy.com/software/BeautifulSoup/CHANGELOG.html."""

I can't imagine which of the vague entries in the changelog you are referring to. Consider debugging your problem before you rush to update.

Update Looks like an obscure bug in sgmllib.py. In line 394, change 255 to 127 and it appears to work. Corner case: HTML char ref (®) in an attribute value AND with 128 <= ordinal < 255.

Further comments Rather than hack your copy of sgmllib.py, grab a copy of the latest sgmllib.py from the 2.7 branch -- BS 3.0.4 ran OK for me on Python 2.7.1. Even better, upgrade your Python to 2.7.

edited Jul 18 '11 at 09:08

answered Jul 17 '11 at 21:49

John Machin

81,303
11
141
189

`uc = html.decode('utf8')` works, but `soup = BeautifulSoup(uc)` fails with the error message I added above. – rolling stone Jul 17 '11 at 22:42
also here's the part of the HTML that I think is causing the problem - the registered symbol in the title tag: `Onitsuka Tiger by Asics Ultimate 81® ` – rolling stone Jul 17 '11 at 22:43
I don't see anywhere in the docs where it says you can feed it a unicode object. Have you tried `soup = bs(html, fromEncoding='utf8')`? This is becoming tedious. Do you have a good reason for not divulging exactly which page on which shoe-shop website you are trying to scrape?? – John Machin Jul 18 '11 at 00:15
More questions: (1) How is that ® actually represented in the raw bytes version of your html: `'\xae'`? `'\xc2\xae'`? `'®'`? something else? (2) What happens when you do `html.decode('ascii')`? – John Machin Jul 18 '11 at 00:42
No reason at all - here's the link: http://www.6pm.com/onitsuka-tiger-by-asics-ultimate-81. Answers to your questions: (1) `®` and (2) I get the following error `Traceback (most recent call last): File "", line 1, in UnicodeDecodeError: 'ascii' codec can't decode byte 0xc2 in position 1198: ordinal not in range(128)` – rolling stone Jul 18 '11 at 01:17
And I don't think you've found the part of the HTML that's causing the problem; according to the sgmllib.py code where it crashes, it's found a start-tag, then matched attributename = value, and it's having trouble with the value ... – John Machin Jul 18 '11 at 01:19
interesting...could very well be, was just making a guess. what's the value that it's having trouble with? – rolling stone Jul 18 '11 at 01:20
@rolling stone: Not a problem. I hope I helped you learn about providing as much useful information as possible up front when asking a question like that. Looking forward to the next time ... – John Machin Jul 18 '11 at 09:13
definitely learned the importance of doing that next time. thanks again for your help! – rolling stone Jul 18 '11 at 15:58

score 2 · Answer 2 · answered Jul 18 '11 at 03:35

I tried to use pyquery on the html and the result is fine.

import urllib
from pyquery import PyQuery

html = urllib.urlopen('http://www.6pm.com/onitsuka-tiger-by-asics-ultimate-81').read()
pq = PyQuery(html)
print pq('span#price').text() # "$39.00 40% off MSRP $65.00"

pyquery is based on lxml so it's also much faster than beautifulsoup.

How to Parse HTML with Non-ASCII Characters using BeautifulSoup?

2 Answers2

Linked