I'm having a similar case, right now. I'm trying to download images. I retrieve the URLs from the server in a JSON file. Some of the images contain non-ASCII characters. This throws an error:
for image in product["images"]:
filename = os.path.basename(image)
filepath = product_path + "/" + filename
urllib.request.urlretrieve(image, filepath) # error!
UnicodeEncodeError: 'ascii' codec can't encode character '\xc7' in position ...
I've tried using .encode("UTF-8")
, but can't say it helped:
# coding=UTF-8
import urllib
url = u"http://example.com/wp-content/uploads/2018/09/İMAGE-1.png"
url = url.encode("UTF-8")
urllib.request.urlretrieve(url, "D:\image-1.jpg")
This just throws another error:
TypeError: cannot use a string pattern on a bytes-like object
Then I gave urllib.parse.quote(url)
a go:
import urllib
url = "http://example.com/wp-content/uploads/2018/09/İMAGE-1.png"
url = urllib.parse.quote(url)
urllib.request.urlretrieve(url, "D:\image-1.jpg")
and again, this throws another error:
ValueError: unknown url type: 'http%3A//example.com/wp-content/uploads/2018/09/%C4%B0MAGE-1.png'
The :
in "http://..."
also got escaped, and I think this is the cause of the problem.
So, I've figured out a workaround. I just quote/escape the path, not the whole URL.
import urllib.request
import urllib.parse
url = "http://example.com/wp-content/uploads/2018/09/İMAGE-1.png"
url = urllib.parse.urlparse(url)
url = url.scheme + "://" + url.netloc + urllib.parse.quote(url.path)
urllib.request.urlretrieve(url, "D:\image-1.jpg")
This is what the URL looks like: "http://example.com/wp-content/uploads/2018/09/%C4%B0MAGE-1.png"
, and now I can download the image.