python UnicodeEncodeError > How can I simply remove troubling unicode characters?

Question

Heres what I did..

>>> soup = BeautifulSoup (html)
>>> soup
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'ascii' codec can't encode character u'\xae' in position 96953: ordinal not in range(128)
>>> 
>>> soup.find('div')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'ascii' codec can't encode character u'\xae' in position 11035: ordinal not in range(128)
>>> 
>>> soup.find('span')
<span id="navLogoPrimary" class="navSprite"><span>amazon.com</span></span>
>>>

How can I simply remove troubling unicode characters from html ?
Or is there any cleaner solution ?

score 10 · Accepted Answer · edited Nov 01 '12 at 17:10

10

Try this way: soup = BeautifulSoup (html.decode('utf-8', 'ignore'))

edited Nov 01 '12 at 17:10

Jonas Byström

25,316
23
100
147

answered Mar 08 '11 at 18:46

esv

124
2

Didn't work! Heres what happened.. >>> html.decode('utf-8', 'strip') Traceback (most recent call last): ..... LookupError: unknown error handler name 'strip' >>> >>> html.decode('utf-8') Traceback (most recent call last): ..... UnicodeDecodeError: 'utf8' codec can't decode byte 0xae in position 98071: unexpected code byte >>> – Nullpoet Mar 08 '11 at 19:04
1

I am very sorry, 'ignore' instead of 'strip'. Also I recommend to read the Unicode HOWTO http://docs.python.org/howto/unicode.html – esv Mar 08 '11 at 19:08

jfs · Answer 2 · 2011-03-09T12:59:47.587

The error you see is due to repr(soup)tries to mix Unicode and bytestrings. Mixing Unicode and bytestrings frequently leads to errors.

Compare:

>>> u'1' + '©'
Traceback (most recent call last):
  File "<input>", line 1, in <module>
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc2 in position 0: ordinal not in range(128)

And:

>>> u'1' + u'©'
u'1\xa9'
>>> '1' + u'©'
u'1\xa9'
>>> '1' + '©'
'1\xc2\xa9'

Here's an example for classes:

>>> class A:
...     def __repr__(self):
...         return u'copyright ©'.encode('utf-8')
... 
>>> A()
copyright ©
>>> class B:
...     def __repr__(self):
...         return u'copyright ©'
... 
>>> B()
Traceback (most recent call last):
  File "<input>", line 1, in <module>
UnicodeEncodeError: 'ascii' codec can't encode character u'\xa9' in position 10: ordi
nal not in range(128) #' workaround highlighting bug
>>> class C:
...     def __repr__(self):
...         return repr(A()) + repr(B())
...
>>> C()
Traceback (most recent call last):
  File "<input>", line 1, in <module>
  File "<input>", line 3, in __repr__
UnicodeEncodeError: 'ascii' codec can't encode character u'\xa9' in position 10: ordi
nal not in range(128)

Similar thing happens with BeautifulSoup:

>>> html = """<p>©"""
>>> soup = BeautifulSoup(html)
>>> repr(soup)
Traceback (most recent call last):
  File "<input>", line 1, in <module>
UnicodeEncodeError: 'ascii' codec can't encode character u'\xa9' in position 3: ordin
al not in range(128)

To workaround it:

>>> unicode(soup)
u'<p>\xa9</p>'
>>> str(soup)
'<p>\xc2\xa9</p>'
>>> soup.encode('utf-8')
'<p>\xc2\xa9</p>'

score 1 · Answer 3 · edited May 23 '17 at 12:32

1

First of all, "troubling" unicode characters could be letters in some language but assuming you won't have to worry about non-english characters then you can use a python lib to convert unicode to ansi. Check out the answer to this question: How do I convert a file's format from Unicode to ASCII using Python?

The accepted answer there seems like a good solution (that I didn't know about beforehand).

edited May 23 '17 at 12:32

Community

1
1

answered Mar 08 '11 at 18:13

Karim

18,347
13
61
70

That solution isn't working for me as html is not unicode, its just str [>>> unicodedata.normalize('NFKD', html).encode('ascii','ignore') Traceback (most recent call last): File "", line 1, in TypeError: normalize() argument 2 must be unicode, not str ] – Nullpoet Mar 08 '11 at 18:29

score 0 · Answer 4 · edited May 23 '17 at 11:51

0

I had the same problem, spent hours on it. Notice the error occurs whenever the interpreter has to display content, this is because the interpreter is trying to convert to ascii, causing problems. Take a look at the top answer here:

UnicodeEncodeError with BeautifulSoup 3.1.0.1 and Python 2.5.2

edited May 23 '17 at 11:51

Community

1
1

answered Jan 02 '12 at 22:21

SnowFrogger

296
3
4

python UnicodeEncodeError > How can I simply remove troubling unicode characters?

4 Answers4

Linked