how to edit the code to be able to read non-English characters from CSV file

Question

this code is built to download images from links in a column called "link" in CSV file and replace it with the name in another column called "name" but the code stopped working when he is facing a non-English character, I want the code to work also with non-english character

here is the code

import urllib.request
import csv
import os

with open('booklogo.csv') as csvfile:
    reader = csv.DictReader(csvfile)
    
    for row in reader:
        print(row)
        if row["link"] != '' and row["title"] != '':
            name, ext = os.path.splitext(row['link'])
            if ext == '':
                ext = ".png"
            title_filename = f"{row['title']}{ext}".replace('/', '-')
            urllib.request.urlretrieve(row['link'], title_filename)

here is the error


OSError Input In [5], in <cell line: 5>() 13 ext = ".png" 14 title_filename = f"{row['title']}{ext}".replace('/', '-') ---> 15 urllib.request.urlretrieve(row['link'], title_filename) File ~\anaconda3\lib\urllib\request.py:249, in urlretrieve(url, filename, reporthook, data) 247 # Handle temporary file setup. 248 if filename: --> 249 tfp = open(filename, 'wb') 250 else: 251 tfp = tempfile.NamedTemporaryFile(delete=False) OSError: [Errno 22] Invalid argument: 'Albert ?eská republika.png

Be careful, a csv file is not an Excel file, although Excel can open them. The CSV is a text file format. Do you know the encoding format of the file? I advise you to look on this side. — ErnestBidouille, Sep 26 '22 at 13:07
In order to deal with characters you have to carefully look for character encoding. Which encoding is your file in. Which encoding is your script expecting. If there is a mismatch between the two that might explain your issue — Matthias, Sep 26 '22 at 13:08
@D.L here is the error ' OSError Input In [5], in () 13 ext = ".png" 14 title_filename = f"{row['title']}{ext}".replace('/', '-') ---> 15 urllib.request.urlretrieve(row['link'], title_filename) File ~\anaconda3\lib\urllib\request.py:249, in urlretrieve(url, filename, reporthook, data) 247 # Handle temporary file setup. 248 if filename: --> 249 tfp = open(filename, 'wb') 250 else: 251 tfp = tempfile.NamedTemporaryFile(delete=False) OSError: [Errno 22] Invalid argument: 'Albert ?eská republika.png ' — midomid, Sep 26 '22 at 14:20
We can’t tell you the correct encoding without seeing (a representative, ideally small sample of) the actual contents of the data in an unambiguous representation; a hex dump of the problematic byte(s) with a few bytes of context on each side is often enough, especially if you can tell us what you think those bytes are supposed to represent. See also https://meta.stackoverflow.com/questions/379403/problematic-questions-about-decoding-errors — tripleee, Sep 26 '22 at 17:05

Zach Young · Accepted Answer · 2022-09-28T18:30:57.100

2

I think you're correct (in your comment below) that it's probably the question mark.

You need to sanitize your filename. This is not included in Python's standard lib, so we'll draw on the most popular answer to the same issue/question, from Turn a string into a valid filename?.

You'll need to add this function to your file:

import unicodedata
import re

def slugify(value, allow_unicode=False):
    """
    Taken from https://github.com/django/django/blob/master/django/utils/text.py
    Convert to ASCII if 'allow_unicode' is False. Convert spaces or repeated
    dashes to single dashes. Remove characters that aren't alphanumerics,
    underscores, or hyphens. Convert to lowercase. Also strip leading and
    trailing whitespace, dashes, and underscores.
    """
    value = str(value)
    if allow_unicode:
        value = unicodedata.normalize('NFKC', value)
    else:
        value = unicodedata.normalize('NFKD', value).encode('ascii', 'ignore').decode('ascii')
    value = re.sub(r'[^\w\s-]', '', value.lower())
    return re.sub(r'[-\s]+', '-', value).strip('-_')

Then modify your existing code, like:

...
# Sanitize filename.  Will get rid of periods too, so add ext after
title_filename = slugify(row['title'])
title_filename += ext
...

edited Sep 28 '22 at 18:30

answered Sep 26 '22 at 17:03

Zach Young

10,137
4
32
53

could you please post the whole code just to ignore the problem of indentation – midomid Sep 28 '22 at 14:54
@midomid Why is indentation a problem? – Zach Young Sep 28 '22 at 17:19
I tried the first solution that you've provided but the error keep appearing, the second solution I'm stack where should I add your code exactly – midomid Sep 28 '22 at 17:23
here is the error I'm facing when adding the second code: Input In [8] except Exception as e: ^ SyntaxError: invalid syntax – midomid Sep 28 '22 at 17:47
1

I think the problem is because of the non English characters or the "?" because there no space in the name on my csv file "Albert ?eská republika'' – midomid Sep 28 '22 at 18:07
Ah, that's it, the question mark, great! One moment... – Zach Young Sep 28 '22 at 18:10
@midomid, rewrote answer to try and address the question mark. – Zach Young Sep 28 '22 at 18:24

how to edit the code to be able to read non-English characters from CSV file

1 Answers1