0

this code is built to download images from links in a column called "link" in CSV file and replace it with the name in another column called "name" but the code stopped working when he is facing a non-English character, I want the code to work also with non-english character

here is the code

import urllib.request
import csv
import os

with open('booklogo.csv') as csvfile:
    reader = csv.DictReader(csvfile)
    
    for row in reader:
        print(row)
        if row["link"] != '' and row["title"] != '':
            name, ext = os.path.splitext(row['link'])
            if ext == '':
                ext = ".png"
            title_filename = f"{row['title']}{ext}".replace('/', '-')
            urllib.request.urlretrieve(row['link'], title_filename)

here is the error


OSError Input In [5], in <cell line: 5>() 13 ext = ".png" 14 title_filename = f"{row['title']}{ext}".replace('/', '-') ---> 15 urllib.request.urlretrieve(row['link'], title_filename) File ~\anaconda3\lib\urllib\request.py:249, in urlretrieve(url, filename, reporthook, data) 247 # Handle temporary file setup. 248 if filename: --> 249 tfp = open(filename, 'wb') 250 else: 251 tfp = tempfile.NamedTemporaryFile(delete=False) OSError: [Errno 22] Invalid argument: 'Albert ?eská republika.png 

midomid
  • 49
  • 5
  • 1
    Be careful, a csv file is not an Excel file, although Excel can open them. The CSV is a text file format. Do you know the encoding format of the file? I advise you to look on this side. – ErnestBidouille Sep 26 '22 at 13:07
  • In order to deal with characters you have to carefully look for character encoding. Which encoding is your file in. Which encoding is your script expecting. If there is a mismatch between the two that might explain your issue – Matthias Sep 26 '22 at 13:08
  • which character does it fail on, please post the error. – D.L Sep 26 '22 at 13:16
  • the character like these : Ü,ä – midomid Sep 26 '22 at 14:19
  • @D.L here is the error ' OSError Input In [5], in () 13 ext = ".png" 14 title_filename = f"{row['title']}{ext}".replace('/', '-') ---> 15 urllib.request.urlretrieve(row['link'], title_filename) File ~\anaconda3\lib\urllib\request.py:249, in urlretrieve(url, filename, reporthook, data) 247 # Handle temporary file setup. 248 if filename: --> 249 tfp = open(filename, 'wb') 250 else: 251 tfp = tempfile.NamedTemporaryFile(delete=False) OSError: [Errno 22] Invalid argument: 'Albert ?eská republika.png ' – midomid Sep 26 '22 at 14:20
  • Please edit your question and put the error there. – Zach Young Sep 26 '22 at 15:36
  • @ZachYoung done – midomid Sep 26 '22 at 15:46
  • We can’t tell you the correct encoding without seeing (a representative, ideally small sample of) the actual contents of the data in an unambiguous representation; a hex dump of the problematic byte(s) with a few bytes of context on each side is often enough, especially if you can tell us what you think those bytes are supposed to represent. See also https://meta.stackoverflow.com/questions/379403/problematic-questions-about-decoding-errors – tripleee Sep 26 '22 at 17:05

1 Answers1

2

I think you're correct (in your comment below) that it's probably the question mark.

You need to sanitize your filename. This is not included in Python's standard lib, so we'll draw on the most popular answer to the same issue/question, from Turn a string into a valid filename?.

You'll need to add this function to your file:

import unicodedata
import re

def slugify(value, allow_unicode=False):
    """
    Taken from https://github.com/django/django/blob/master/django/utils/text.py
    Convert to ASCII if 'allow_unicode' is False. Convert spaces or repeated
    dashes to single dashes. Remove characters that aren't alphanumerics,
    underscores, or hyphens. Convert to lowercase. Also strip leading and
    trailing whitespace, dashes, and underscores.
    """
    value = str(value)
    if allow_unicode:
        value = unicodedata.normalize('NFKC', value)
    else:
        value = unicodedata.normalize('NFKD', value).encode('ascii', 'ignore').decode('ascii')
    value = re.sub(r'[^\w\s-]', '', value.lower())
    return re.sub(r'[-\s]+', '-', value).strip('-_')

Then modify your existing code, like:

...
# Sanitize filename.  Will get rid of periods too, so add ext after
title_filename = slugify(row['title'])
title_filename += ext
...
Zach Young
  • 10,137
  • 4
  • 32
  • 53