1

I'm working on scraping Oregon Teacher License data for a project I'm doing. Here's my code:

educ_employ = tree.xpath('//tr[15]//td[@bgcolor="#A9EDFC"]//text()')
print educ_employ
#[u'Jefferson Middle School\xa0\xa0(2013 - 2014)']

I want to strip the the "\xa0". This is my code:

educ_employ = ([s.strip('\xa0') for s in educ_employ])
print educ_employ
UnicodeDecodeError: 'ascii' codec can't decode byte 0xa0 in position 0: ordinal not in range(128)

I tried this:

educ_employ = ([s.decode('utf-8').strip('\xa0') for s in educ_employ])
print educ_employ
UnicodeDecodeError: 'ascii' codec can't decode byte 0xa0 in position 0: ordinal not in range(128)

And this:

import sys

reload(sys)
sys.setdefaultencoding('utf-8')

educ_employ = tree.xpath('//tr[15]//td[@bgcolor="#A9EDFC"]//text()')
educ_employ = ([s.decode('utf-8').strip('\xa0') for s in educ_employ])
print educ_employ
>>>

I didn't get an error with the last one but I also didn't get an output. I'm using Python 2.7. Does anyone know how to fix this?

Community
  • 1
  • 1
otteheng
  • 594
  • 1
  • 9
  • 27

1 Answers1

3

You are mixing up unicode objects and str objects. educ_employ is a unicode, but '\xa0' is a str.

Additionally, .strip() only removes characters from the beginning and end of the string, not the middle. Try .replace() instead.

Try:

educ_employ = [u'Jefferson Middle School\xa0\xa0(2013 - 2014)']
educ_employ = [s.replace(u'\xa0', u'') for s in educ_employ]
print educ_employ
Robᵩ
  • 163,533
  • 20
  • 239
  • 308
  • Thanks that worked great. I actually tried `.replace` before but it didn't work. I noticed that you include a `u` in your code. Just curious what that means and why it makes a difference? – otteheng Mar 18 '16 at 14:58
  • String literals without the `u` are objects of type `str` String literals with the `u` are objects of type `unicode`. The essence of my answer is "don't mix `str` and `unicode`;" adding `u` to the string literal demonstrates that. – Robᵩ Mar 18 '16 at 15:03
  • That makes a lot sense. Thanks for the clarification! – otteheng Mar 18 '16 at 15:07
  • @otteheng again, this confusion does not exist on `python3` – Antti Haapala -- Слава Україні Mar 18 '16 at 15:29