2

I'm a python beginner. I wrote code as following:

from bs4 import BeautifulSoup
import requests

url = "http://www.google.com"
response = requests.get(url)
soup = BeautifulSoup(response.text, "html.parser")
links = soup.find_all("a")
for link in links:
    print(link.text)

When run this .py file in windows powershell, the print(link.text) causes the following error.

error: UnicodeEncodeError: 'gbk' codec can't encode charactor '\xbb' in position 5: 
illegal multibyte sequence.

I know the error is caused by some chinese characters, and It seem like I should use 'decode' or 'ignore', but I don't know how to fix my code. Help please! Thanks!

Wasi Ahmad
  • 35,739
  • 32
  • 114
  • 161
Jack
  • 31
  • 1
  • 3

2 Answers2

0

If you don't wish to display those special chars:
You can ignore them by:

print(link.text.encode(errors="ignore"))
Anurag
  • 59
  • 1
  • 1
  • 6
0

You can encode the string in utf8.

for link in links:
    print(link.text.encode('utf8'))

But better approach is:

response = requests.get(url)
soup = BeautifulSoup(response.text.encode("utf8"), "html.parser")

To understand more about the problem you are facing, you should look at this stackoverflow answer.

Community
  • 1
  • 1
Wasi Ahmad
  • 35,739
  • 32
  • 114
  • 161