weird character printed when web scraping

Question

I tried writing some code to find and print the price of a specific book but when I ran the code it returned Â£54.23.

What is Â? How do I make it go away?

From my understanding I'm supposed to copy the CSS path for soup.select but since this option did not show up on chrome I copied selector. Could this be responsible for Â?

Here's my Python code:

import requests
from bs4 import BeautifulSoup

user_agent = 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/97.0.4692.71 Safari/537.36'
headers = {'User-Agent': user_agent}
res_obj = requests.get('http://books.toscrape.com/')
res_obj.raise_for_status()
soup = BeautifulSoup(res_obj.text, 'html.parser')
sapiens_price = soup.select('#default > div > div > div > div > section > div:nth-child(2) > ol > li:nth-child(5) > article > div.product_price > p.price_color')
print(sapiens_price[0].text)

That is the link for the screenshot of selector and other options I could copy. For some reason I can't post the link as an attachment. — CountRavioli, Jan 09 '22 at 21:03
You're likely decoding the document with the wrong text encoding. Compare the encoding the document declares in the `content-type` http response header to whatever encoding you are using to decode. — nlta, Jan 09 '22 at 21:12
Similar issue here: https://stackoverflow.com/questions/1461907/html-encoding-issues-%C3%82-character-showing-up-instead-of-nbsp — Tluther, Jan 09 '22 at 21:14
The content-type is text/html but the content-encoding is gzip. Since my program uses Python 3.10.0 it seems the webpage must be decoded using utf-8. — CountRavioli, Jan 09 '22 at 22:06

score 0 · Answer 1 · answered Jan 09 '22 at 23:06

try this:

soup = BeautifulSoup(res_obj.text, 'html.parser')

sapiens_price = soup.select('#default > div > div > div > div > section > div:nth-child(2) > ol > li:nth-child(5) > article > div.product_price > p.price_color')

print(sapiens_price[0].text.encode('ascii', 'ignore').decode())

Upendra · Accepted Answer · 2022-01-10T03:55:33.963

The reason is that response.text is not using the correct encoding.

See requests documentation, and notice this:

When you make a request, Requests makes educated guesses about the encoding of the response based on the HTTP headers. The text encoding guessed by Requests is used when you access r.text

In your case, if you run your code in an IDLE, this is what you get when checking the encoding:

>>> res_obj.encoding
'ISO-8859-1'

Again from the documentation:

If you change the encoding, Requests will use the new value of r.encoding whenever you call r.text

To override this guessed the encoding simply set the new encoding. In your case, it will be UTF-8:

>>> res_obj.encoding='UTF-8'

Do this before accessing res_obj.text and your code will work correctly:.

res_obj = requests.get('http://books.toscrape.com/')
 # SET ENCODING MANUALLY
res_obj.encoding='utf-8'
soup = BeautifulSoup(res_obj.text, 'html.parser')
sapiens_price = soup.select('...')
print(sapiens_price[0].text)

TLDR; use res.encoding='utf-8' before accessing res.text.

weird character printed when web scraping

2 Answers2