Python Webscraping: Problems parsing chinese characters with beautiful soup/requests

Question

I am scraping a Chinese website and usually there is no problem to parse the chinese characters which i use to find specific urls with the pattern function within bs4. However, for this particular chinese website the soup cannot be parsed properly. Below is the code i use to set up the soup:

start = f'http://www.shuichan.cc/news_list.asp?action=&c_id=93&s_id=210&page={1}'
r = requests.get(start)
soup = bs(r.content, "html.parser")

An example of the printed soup is the following:

Current soup

Note: I had to add a picture as Stack though it was spam :)

The above should have looked like the following:

Proper soup

I wonder if i have to specify some kind of encoding within the request or perhaps something within the soup but as for now i have not found anything that would work.

Thanks in advance!

score 1 · Accepted Answer · answered Dec 18 '20 at 09:54

1

I don't know Chinese. Does this give the desired results?

import requests
from bs4 import BeautifulSoup as bs

start = f'http://www.shuichan.cc/news_list.asp?action=&c_id=93&s_id=210&page={1}'
r = requests.get(start)
soup = bs(r.content.decode('GBK', 'ignore'), "html.parser")

print(soup)

answered Dec 18 '20 at 09:54

chitown88

27,527
4
30
59

Yes it does - Thank you for your help! – Spedtsberg Dec 18 '20 at 10:13
Can you elaborate on what the fix does? – Spedtsberg Dec 18 '20 at 10:14
it's just the character set for the site. Just need to encode it. Can read more [here](https://stackoverflow.com/questions/53954604/python-encoding-chinese-to-special-character) – chitown88 Dec 18 '20 at 10:18
This is exactly what I needed for scraping a Chinese language website. Thanks. – Life is complex Oct 11 '21 at 14:55

Python Webscraping: Problems parsing chinese characters with beautiful soup/requests

1 Answers1

Linked