0

I am trying to search for certain Chinese characters in a site, but it always comes up as not found. Here is my code that I have so far. Site is in Chinese

from random import randint
for _ in range(1):
    value = randint(100000000, 999999999)
    # print(value)

#Gets link + puts together
shop = 'https://shop'
taobao = '.taobao.com'
tempLink = 'https://shop357612815.taobao.com/'
link = shop + str(value) + taobao

#request stuff
from urllib.request import urlopen
import urllib.request

#search word list
words = ['2017', '2018', '2019', 'tide brand', 'taobao', '.00', 'palace', 'ader error',
         'vlone', 'fog', 'fear of god', 'assc', 'anti', '4.', '5.', '首页']

#searcher
site = urllib.request.urlopen(link).read().decode('utf-8', errors = 'ignore')
for word in words:
    if word in site:
       print(word, link)

If I remove the errors = 'ignore' part it then stops working and gives the error code:

UnicodeDecodeError: 'utf-8' codec can't decode byte 0xb5 in position 267: invalid start byte
  • What line is causing the error? It's often best to include a full traceback in questions you post here. – martineau May 26 '20 at 22:16
  • 1
    You are telling Python that the input text is UTF-8 encoded. You seem pretty sure of that, but how do you know? – Jongware May 26 '20 at 22:16
  • It sounds like the data you are reading from that site is not UTF-8 encoded. Are you sure you are only decoding the HTML content and not any headers or other HTTP metadata? – Code-Apprentice May 26 '20 at 22:17
  • @Code-Apprentice I honestly don't know. This is my first time trying a project like this. What I can tell you is that the site is written in full Chinese characters. – NoobierNoob May 26 '20 at 22:20
  • @martineau `File "c:/Users/-------/Desktop/TaoBao Attempt/taobaoShopRandomizer.py", line 22, in site = urllib.request.urlopen(link).read().decode('utf-8') UnicodeDecodeError: 'utf-8' codec can't decode byte 0xb5 in position 267: invalid start byte` Hope this helps – NoobierNoob May 26 '20 at 22:22
  • 2
    Please [edit] your original question with that info. – Code-Apprentice May 26 '20 at 22:24
  • The `https://shop357612815.taobao.com` webpage is encoded in GBK, not UTF-8. – martineau May 26 '20 at 23:03
  • @martineau changing it to GBK worked! Thank you! For in the future, how did you figure out what it was encoded in? – NoobierNoob May 26 '20 at 23:09
  • 1
    I cheated a little and used a "Page Info" feature my web-browser has. You could also determine it it manually examining the webpage response header information. I'm no expert, but strongly suspect there's probably some existing Python module that could tell you this. – martineau May 26 '20 at 23:14
  • The answer to [How to identify character encoding from website?](https://stackoverflow.com/questions/15073937/how-to-identify-character-encoding-from-website) might help. – martineau May 26 '20 at 23:36

0 Answers0