How to search for exact Chinese characters in a site?

Question

I am trying to search for certain Chinese characters in a site, but it always comes up as not found. Here is my code that I have so far. Site is in Chinese

from random import randint
for _ in range(1):
    value = randint(100000000, 999999999)
    # print(value)

#Gets link + puts together
shop = 'https://shop'
taobao = '.taobao.com'
tempLink = 'https://shop357612815.taobao.com/'
link = shop + str(value) + taobao

#request stuff
from urllib.request import urlopen
import urllib.request

#search word list
words = ['2017', '2018', '2019', 'tide brand', 'taobao', '.00', 'palace', 'ader error',
         'vlone', 'fog', 'fear of god', 'assc', 'anti', '4.', '5.', '首页']

#searcher
site = urllib.request.urlopen(link).read().decode('utf-8', errors = 'ignore')
for word in words:
    if word in site:
       print(word, link)

If I remove the errors = 'ignore' part it then stops working and gives the error code:

UnicodeDecodeError: 'utf-8' codec can't decode byte 0xb5 in position 267: invalid start byte

What line is causing the error? It's often best to include a full traceback in questions you post here. — martineau, May 26 '20 at 22:16
You are telling Python that the input text is UTF-8 encoded. You seem pretty sure of that, but how do you know? — Jongware, May 26 '20 at 22:16
It sounds like the data you are reading from that site is not UTF-8 encoded. Are you sure you are only decoding the HTML content and not any headers or other HTTP metadata? — Code-Apprentice, May 26 '20 at 22:17
@Code-Apprentice I honestly don't know. This is my first time trying a project like this. What I can tell you is that the site is written in full Chinese characters. — NoobierNoob, May 26 '20 at 22:20
@martineau `File "c:/Users/-------/Desktop/TaoBao Attempt/taobaoShopRandomizer.py", line 22, in site = urllib.request.urlopen(link).read().decode('utf-8') UnicodeDecodeError: 'utf-8' codec can't decode byte 0xb5 in position 267: invalid start byte` Hope this helps — NoobierNoob, May 26 '20 at 22:22
The `https://shop357612815.taobao.com` webpage is encoded in GBK, not UTF-8. — martineau, May 26 '20 at 23:03
@martineau changing it to GBK worked! Thank you! For in the future, how did you figure out what it was encoded in? — NoobierNoob, May 26 '20 at 23:09
I cheated a little and used a "Page Info" feature my web-browser has. You could also determine it it manually examining the webpage response header information. I'm no expert, but strongly suspect there's probably some existing Python module that could tell you this. — martineau, May 26 '20 at 23:14
The answer to [How to identify character encoding from website?](https://stackoverflow.com/questions/15073937/how-to-identify-character-encoding-from-website) might help. — martineau, May 26 '20 at 23:36

How to search for exact Chinese characters in a site?

0 Answers0