Python scraping gives unicode error

Question

res = requests.get(self.urlBase)
soup = BeautifulSoup(html)
print soup.prettify()

gives the error:

'ascii' codec can't encode character u'\xa0' in position 10816: ordinal not in range(128)

I'm using Requests and BeautifulSoup4.

I assume it has to do with unicode? Every single example I have seen uses it this way without issues. Not sure what why there's a problem with my encoding?

The content type is text/html; charset=UTF-8

You have an **encoding error**; you are most likely experiencing a problem with `print`, not BeautifulSoup. Are you redirecting the output or using an IDE console here? — Martijn Pieters, Aug 06 '14 at 16:14
Can you give an example url or html which would cause the error? — heinst, Aug 06 '14 at 16:17
Also, you store the response in `res`, but don't show us where you got `html` from. — Martijn Pieters, Aug 06 '14 at 16:17
[PrintFails](https://wiki.python.org/moin/PrintFails). Your scraping is fine but printing Unicode to the Windows console from apps like Python that use the C standard library IO just doesn't work. — bobince, Aug 06 '14 at 16:51

score 1 · Answer 1 · answered Aug 06 '14 at 16:13

1

Try

print soup.decode('utf-8', 'ignore').prettify()

This will parse the soup string ignoring all the characters it cannot comprehend

If you don't choose the 'ignore' parameter, it will throw an error when encountering a non-ascii character

answered Aug 06 '14 at 16:13

cionescu

144
9

The OP has a **encoding** error, not problems decoding. – Martijn Pieters Aug 06 '14 at 16:15
The assumption that it is BeautifulSoup is natural, as the OP didn't include a traceback. But the error message is enlightening just the same. – Martijn Pieters Aug 06 '14 at 16:17
Thanks, that was close; I had to `encode` it apparently – Tjorriemorrie Aug 06 '14 at 20:42

score 1 · Answer 2 · edited May 23 '17 at 12:14

You are correct that this has to do with Unicode, and essentially, this is saying that it can't directly print out some characters to the command line because the character '\xa0', which is the Latin non-breaking space, apparently. For fixing this specific problem, see this link.

Edit: see comments below for more specific information regarding the print module, as well as a more thorough and complete description of what may be causing the problem.

Edit: This link mentions the same error and in a comment it's mentioned that the 'ascii' codec error is unique to Python 2.x, from the request and other urllib modules. This confirms my statement from before, although it is not exhaustively documented.

Now for some unsolicited advice: If the program this involves is small and does not have many dependencies or use libraries that only exist in Python 2, Use Python 3. I started out writing a web scraping project earlier this summer and started writing in Python 2.7, and ultimately got several errors involving Unicode decoding that I ultimately could not resolve, even if I used the decoding modules on the strings themselves.

I then stumbled across the fact that Python 3 was actually made specifically for fixing what Guido van Rossum himself said was "breaking Python"- once and for all uniting Unicode and strings.

The reason I was asking if your code was relatively small- I actually upgraded my whole script, which was about 400 lines, to Python 3 in a few minutes- especially since I had a good interpreter which told me the syntax issues that would arise. There are a few differences, but not very many, and you will be happy that you did this.

Short-term fix: use the (limited) support Python 2 has for Unicode.

Long-term fix: Find a way to port to Python 3.

Edit: Because this code specifically refers to the print module, I retract my statements as I do not have enough specific experience in the print module to make test cases in both Python 2.x and 3.x stating that a switch to Python 3 will necessarily fix this. It would be worth a reply from the OP, however, to see if the issue is addressed.

Edit 2: To further make matters more inconclusive, I have tried the following codes in Python 2.7 and Python 3.4:

Python 2.7:

from bs4 import BeautifulSoup
soup = BeautifulSoup(u'string with "\xa0" character')
print soup.prettify()

Python 3.4:

from bs4 import BeautifulSoup
soup = BeautifulSoup('string with "\xa0" character')
print(soup.prettify())

Both ways return the same expected answer. Even removing the Unicode classifier from the string does not affect Python 2.7's output. Further investigation is needed.

BeautifulSoup gives you Unicode output by default; the problem here is in the environment and Python cannot determine what encoding the environment requires, or has determined that ASCII is the correct codec for that environment. The only difference Python 3 would make is that a different default might be picked that could equally be wrong. — Martijn Pieters, Aug 06 '14 at 16:45
Without more context from the OP we don't know what line is producing the problem, nor do we know what terminal, console or IDE is being used, or if perhaps a redirection to a file is used. In other words, switching to Python 3 won't necessarily help here. — Martijn Pieters, Aug 06 '14 at 16:46
@MartijnPieters While I agree that it's true that there is no way at the moment to know for sure what is throwing the error, it's extremely likely that the error here is coming from Python in trying to handle a conversion from a `Unicode` to a `str` because the `ascii codec` is mentioned. Python 3 has no such problems because no such conversions are necessary (both are considered the same type), — punyidea, Aug 06 '14 at 16:51
Not so. Python 3 still has to encode unicode strings to byte strings for display on the console because to a stdio application the console/stdout stream is a byte stream. Consequently all the PrintFails problems of Python 2 are still there in Python 3. — bobince, Aug 06 '14 at 16:53
The transition of string types between Python 2 and Python 3 is deeply misunderstood. You still have byte strings and unicode strings just like before; the differences are just that (1) they are named differently (`str`/`unicode` -> `bytes`/`str`), (2) the default string literal gives you a unicode string, (3) more library interfaces use unicode strings, in places which are not by nature explicitly byte-based, and (4) there are fewer places where Python implicitly converts between bytes and unicode for you, potentially getting it wrong. — bobince, Aug 06 '14 at 16:58
@punyidea: no, you are misunderstanding how Python 3 handles text. As I stated, BeautifulSoup handles Unicode correctly. Python 3 still has a bytes type, and printing to the console *still* needs to encode the Unicode string to bytes. If the console claims it is using ASCII as the codec, then this fails just as hard in Python 3. — Martijn Pieters, Aug 06 '14 at 17:03
@bobince, while we are getting out of my field of expertise, I will say that all such occurrences of the `'ascii' codec` error I've encountered have ceased since my switch to Python 3. I could be mistaken for these reasons, however. — punyidea, Aug 06 '14 at 17:06

Tjorriemorrie · Accepted Answer · 2014-08-14T05:51:16.957

0

print soup.prettify().encode('utf8')

Although to dump the contents to view from the response itself before soup works better:

res = requests.get('urlfoobar')
print res.content

edited Aug 14 '14 at 05:51

answered Aug 06 '14 at 20:41

Tjorriemorrie

16,818
20
89
131

Python scraping gives unicode error

3 Answers3