'charmap' codec can't encode character '\xae' While Scraping a Webpage

Question

I am web-scraping with Python using BeautifulSoap I am getting this error

'charmap' codec can't encode character '\xae' in position 69: character maps to <undefined>

when scraping a webpage

This is my Python

hotel = BeautifulSoup(state.)
print (hotel.select("div.details.cf span.hotel-name a"))
# Tried:  print (hotel.select("div.details.cf span.hotel-name a")).encode('utf-8')

Looks like this may be the issue: http://stackoverflow.com/a/4197411/2372812. You need to set the codec much earlier than in your provided `# Tried:` line. — Tom Dalton, Nov 07 '14 at 17:45

Irshad Bhat · Answer 1 · 2014-11-07T18:43:26.520

We usually encounter this problem here when we are trying to .encode() an already encoded byte string. So you might try to decode it first as in

html = urllib.urlopen(link).read()
unicode_str = html.decode(<source encoding>)
encoded_str = unicode_str.encode("utf8")

As an example:

html = '\xae'
encoded_str = html.encode("utf8")

Fails with

UnicodeDecodeError: 'ascii' codec can't decode byte 0xa0 in position 0: ordinal not in range(128)

While:

html = '\xae'
decoded_str = html.decode("windows-1252")
encoded_str = decoded_str.encode("utf8")
print encoded_str
®

Succeeds without error. Do note that "windows-1252" is something I used as an example. I got this from chardet and it had 0.5 confidence that it is right! (well, as given with a 1-character-length string, what do you expect) You should change that to the encoding of the byte string returned from .urlopen().read() to what applies to the content you retrieved.

You noted you have to use decode('latin-1') not encode('latin-1'). — Irshad Bhat, Nov 07 '14 at 18:16

'charmap' codec can't encode character '\xae' While Scraping a Webpage

1 Answers1

Linked