UnicodeDecodeError: 'ascii' codec can't decode byte 0xa0 in position 0: ordinal not in range(128)

Question

I'm working on scraping Oregon Teacher License data for a project I'm doing. Here's my code:

educ_employ = tree.xpath('//tr[15]//td[@bgcolor="#A9EDFC"]//text()')
print educ_employ
#[u'Jefferson Middle School\xa0\xa0(2013 - 2014)']

I want to strip the the "\xa0". This is my code:

educ_employ = ([s.strip('\xa0') for s in educ_employ])
print educ_employ
UnicodeDecodeError: 'ascii' codec can't decode byte 0xa0 in position 0: ordinal not in range(128)

I tried this:

educ_employ = ([s.decode('utf-8').strip('\xa0') for s in educ_employ])
print educ_employ
UnicodeDecodeError: 'ascii' codec can't decode byte 0xa0 in position 0: ordinal not in range(128)

And this:

import sys

reload(sys)
sys.setdefaultencoding('utf-8')

educ_employ = tree.xpath('//tr[15]//td[@bgcolor="#A9EDFC"]//text()')
educ_employ = ([s.decode('utf-8').strip('\xa0') for s in educ_employ])
print educ_employ
>>>

I didn't get an error with the last one but I also didn't get an output. I'm using Python 2.7. Does anyone know how to fix this?

see the "common idiom" in http://stackoverflow.com/questions/4374455/how-to-set-sys-stdout-encoding-in-python-3 — mpez0, Mar 18 '16 at 14:40
@mpez0 Sorry I'm having trouble understanding what's going on. What is `codecs` represent?/I'm fairly new to programming so this is a little over my head. — otteheng, Mar 18 '16 at 14:50
@otteheng just ignore them :D however if you are not absolutely required to use Python 2, I recommend switching to Python 3, it makes dealing with text / unicode easier. — Antti Haapala -- Слава Україні, Mar 18 '16 at 14:52
The problem here (considering Python 2), doesn’t seem to `s`, but to `'\xa0'` itself. Try `.strip(u'\xa0')` instead. — Arĥimedeς ℳontegasppα ℭacilhας, Mar 18 '16 at 17:07
@AnttiHaapala did you look at the "common idiom" for python2 portion of my linked reply? — mpez0, Mar 21 '16 at 15:46

score 3 · Accepted Answer · answered Mar 18 '16 at 14:46

3

You are mixing up unicode objects and str objects. educ_employ is a unicode, but '\xa0' is a str.

Additionally, .strip() only removes characters from the beginning and end of the string, not the middle. Try .replace() instead.

Try:

educ_employ = [u'Jefferson Middle School\xa0\xa0(2013 - 2014)']
educ_employ = [s.replace(u'\xa0', u'') for s in educ_employ]
print educ_employ

answered Mar 18 '16 at 14:46

Robᵩ

163,533
20
239
308

Thanks that worked great. I actually tried `.replace` before but it didn't work. I noticed that you include a `u` in your code. Just curious what that means and why it makes a difference? – otteheng Mar 18 '16 at 14:58
String literals without the `u` are objects of type `str` String literals with the `u` are objects of type `unicode`. The essence of my answer is "don't mix `str` and `unicode`;" adding `u` to the string literal demonstrates that. – Robᵩ Mar 18 '16 at 15:03
That makes a lot sense. Thanks for the clarification! – otteheng Mar 18 '16 at 15:07
@otteheng again, this confusion does not exist on `python3` – Antti Haapala -- Слава Україні Mar 18 '16 at 15:29

UnicodeDecodeError: 'ascii' codec can't decode byte 0xa0 in position 0: ordinal not in range(128)

1 Answers1