Encoding problem when reading urls from a csv file in python

Question

I have a csv file with some urls; after reading through it, in python with:

import csv
rows = []
with open("links.csv","r", encoding = "utf-8") as c:
    csv_reader = csv.reader(c)
    for row in csv_reader:
        rows.append(row)

It returns a list with all the urls this csv file contains. Then, I try to get an element via its XPATH route with "requests" and "lxml"

import requests
import lxml.html as html
def scraper():
    for i in range(len(rows)):
        try:
            article = requests.get(rows[i][0]) 
            if article.status_code == 200:
                artc = article.content.decode("utf-8")
                parsed = html.fromstring(artc)
                img_url = parsed.xpath(URL_1)
                img_url.append(img_links)
            else:
                raise ValueError(f"Error: {article.status_code!r}")
        except ValueError as ve:
            print(ve)

Now the problem is, when I run this program, the following error appears:

No connection adapters were found for '\ufeffhttps://detail.1688.com/offer/524898885299.html?spm=a26352.b28411319.offerlist.290.ad1b1e625GXj5E' 'utf-8' codec can't decode byte: invalid start byte

As a note: All these links are from chinese web pages such as 1688 or taobao which makes me think the problem has something to do with encoding. I've tried using "utf-8-sig". This solves the --'\ufeff problem-- but does not solve the --can't decode byte--

What line exactly does raise the error? Said differently, could you show the full stacktrace? — Serge Ballesta, Jun 14 '22 at 07:53
If the response cannot be decoded as UTF-8, then it's not UTF-8, which is not entirely surprising for Chinese websites. Have you checked what encoding they use and tried that instead? — deceze, Jun 14 '22 at 07:53
There are two separate things happening here. (1) you're now correctly removing the UTF8 BOM by using "utf-8-sig" - great (2) in 'article.content.decode("utf-8")' you are making the assumption that the page is using UTF-8 encoding, which is not necessarily true. You may have been intercepted by some intermediate captcha page that is not using UTF-8, etc. You will need to inspect the article.content and try to determine the character set, possibly using chardet or equivalent. — Rusticus, Jun 14 '22 at 12:22

Encoding problem when reading urls from a csv file in python

0 Answers0