1

I'm a python newbie and I'm in trouble with encoding and URLs. My goal is to download a list of URLs in a text file. My script run well, but I've got errors with some URLs that contains some french accents (like éèà etc.).

Here is my code :

#!/usr/bin/env python
# coding: utf8

import urllib.request
import os
import codecs
import io

# Variables settings

URL = ""
finalFileName = ""
listFiles = "fichiers.txt"
nbLines = 0
currentLine = 1

# Open the file

print ("Open the source file...")

file = open(listFiles, "r")
lines = file.readlines()
# Get line numbers
for line in lines:
    nbLines += 1
file.close()

# Download the file

print ("Download the " + str(nbLines) + " files started")

# Read the file line per line
for line in lines :

    URL = line.replace("\n", "")
    finalFileName= os.path.basename(URL)
    print ("Download " + finalFileName + " [" + str(currentLine) + "/" + str(nbLines) + "]")
    # Download the file
    urllib.request.urlretrieve (URL,finalFileName)
    # Incremanting count
    currentLine += 1

print ("Done")

I've got this error next :

Download racers-saturewood-300x225.jpg [15/993]
Download _81______r-s-oil-top-finish_363.jpg [16/993]
Download traitement_thermo_traite.jpg [17/993]
Download Blanchiment-du-Douglas-exposé-NORD-150x150.jpg [18/993]
Traceback (most recent call last):
  File "D:\Bureau\images-site\dlimage.py", line 39, in <module>
    urllib.request.urlretrieve (URL,finalFileName)
  File "C:\Python34\lib\urllib\request.py", line 186, in urlretrieve
    with contextlib.closing(urlopen(url, data)) as fp:
  File "C:\Python34\lib\urllib\request.py", line 161, in urlopen
    return opener.open(url, data, timeout)
  File "C:\Python34\lib\urllib\request.py", line 463, in open
    response = self._open(req, data)
  File "C:\Python34\lib\urllib\request.py", line 481, in _open
    '_open', req)
  File "C:\Python34\lib\urllib\request.py", line 441, in _call_chain
    result = func(*args)
  File "C:\Python34\lib\urllib\request.py", line 1210, in http_open
    return self.do_open(http.client.HTTPConnection, req)
  File "C:\Python34\lib\urllib\request.py", line 1182, in do_open
    h.request(req.get_method(), req.selector, req.data, headers)
  File "C:\Python34\lib\http\client.py", line 1088, in request
    self._send_request(method, url, body, headers)
  File "C:\Python34\lib\http\client.py", line 1116, in _send_request
    self.putrequest(method, url, **skips)
  File "C:\Python34\lib\http\client.py", line 973, in putrequest
    self._output(request.encode('ascii'))
UnicodeEncodeError: 'ascii' codec can't encode characters in position 65-66: ordinal not in range(128)

I've try some options to not have errors :

URL.encode('utf8') (refuse to convert caracters, UnicodeEncodeError: 'ascii' codec can't encode characters in position 65-66: ordinal not in range(128)) URL.decode() (not work)

I'm lost and I don't know how to solve this troubles, can you help me please ?

Thanks Greetings Arthur

Arthur C-G
  • 1,481
  • 4
  • 16
  • 23
  • 1
    It's a bit hard to read your half-french source but if I'm understanding you correctly, you need to run the URL you're trying to fetch through urllib.quote() (urllib.parse.quote in python3) to get URL encoded form of the string. – miq Sep 08 '15 at 12:03
  • Sorry about that I translated it in english now... :/ – Arthur C-G Sep 08 '15 at 12:10
  • Thanks for your answer ! I just want to download the file. It works until the script has an URL with non-unicode caracters. I try to set in parameters the URL with a right encoding (like UTF-8) but it didn't worked... – Arthur C-G Sep 08 '15 at 12:12
  • Like I said, try `URL = urllib.parse.quote(URL)` before call to urlretrieve. – miq Sep 08 '15 at 12:16
  • Sorry ^^ So, I've got this error : ValueError: unknown url type: 'http%3A//www.website.com/img/blog.jpg' – Arthur C-G Sep 08 '15 at 12:18
  • Could my source file be the problem here ? – Arthur C-G Sep 08 '15 at 12:19
  • Nope, that was my bad. Shouldn't obviously use quote on full url. Check this [thread](https://stackoverflow.com/questions/120951/how-can-i-normalize-a-url-in-python) for possible solutions. – miq Sep 08 '15 at 12:23
  • Hmm...I tryed that : `import urllib.parse [...] URLquote = quote(URL, safe="éàè")` And I got this : NameError: name 'quote' is not defined Have you got an idea ? – Arthur C-G Sep 08 '15 at 12:34
  • You shouldn't use those characters as safe becouse they are the ones you want to encode. Use the characters defined in the answer you copied it from and change `quote` to `urllib.parse.quote`. – miq Sep 08 '15 at 12:40
  • The command works, do you mean : `URLquote = urllib.parse.quote(URL, safe="é")` ? – Arthur C-G Sep 08 '15 at 12:50

0 Answers0