Can't open Unicode URL with Python

Question

Using Python 2.5.2 and Linux Debian, I'm trying to get the content from a Spanish URL that contains a Spanish char 'í':

import urllib
url = u'http://mydomain.es/índice.html'
content = urllib.urlopen(url).read()

I'm getting this error:

UnicodeEncodeError: 'ascii' codec can't encode character u'\xe1' in position 8: ordinal not in range(128)

I've tried using before passing the url to urllib this:

url = urllib.quote(url)

and this:

url = url.encode('UTF-8')

but they didn't work.

Can you tell me what I am doing wrong ?

score 7 · Answer 1 · answered Dec 16 '09 at 18:41

This works for me:

#!/usr/bin/env python
# define source file encoding, see: http://www.python.org/dev/peps/pep-0263/
# -*- coding: utf-8 -*-

import urllib
url = u'http://example.com/índice.html'
content = urllib.urlopen(url.encode("UTF-8")).read()

score 5 · Accepted Answer · answered Dec 16 '09 at 18:42

5

Per the applicable standard, RFC 1378, URLs can only contain ASCII characters. Good explanation here, and I quote:

"...Only alphanumerics [0-9a-zA-Z], the special characters "$-_.+!*'()," [not including the quotes - ed], and reserved characters used for their reserved purposes may be used unencoded within a URL."

As the URLs I've given explain, this probably means you'll have to replace that "lowercase i with acute accent" with `%ED'.

answered Dec 16 '09 at 18:42

Alex Martelli

854,459
170
1,222
1,395

4

I believe this has since been changed, and at least domains can now contain arbitrary Unicode characters. – Cerin May 19 '13 at 02:44
@Cerin Sort of. [IRIs can contain arbitrary Unicode characters](https://www.w3.org/International/articles/idn-and-iri), but when you convert them to regular URIs they're normalised to ASCII using 'Punycode' (for the domain component) and percent-encoding (for the path component). – Daisy Leigh Brenecki Aug 25 '16 at 05:36

score 4 · Answer 3 · answered Dec 16 '09 at 18:40

Encoding the URL as utf-8, should have worked. I wonder if your source file is properly encoded, and whether the interpreter knows it. If your python source file is saved as UTF-8, for example, then you should have

# coding=UTF-8

as the first or second line.

import urllib
url = u'http://mydomain.es/índice.html'
content = urllib.urlopen(url.encode('utf-8')).read()

works for me.

Edit: also, be aware that Unicode text in an interactive Python session (whether through IDLE, or a console) is fraught with encoding-related difficulty. In those cases, you should use Unicode literals (like \u00ED in your case).

score 3 · Answer 4 · answered Dec 16 '09 at 18:43

3

It works for me. Make sure you're using a fairly recent version of Python, and your file encoding is correct. Here's my code:

# -*- coding: utf-8 -*-
import urllib
url = u'http://mydomain.es/índice.html'
url = url.encode('utf-8')
content = urllib.urlopen(url).read()

(mydomain.es does not exist, so the DNS lookup fails, but there are no unicode issues to that point.)

answered Dec 16 '09 at 18:43

Eddie Sullivan

776
5
9

2

With python 3 I get `AttributeError: 'bytes' object has no attribute 'timeout'` when using this code. Is there a python 3 solution? – byxor Sep 27 '16 at 17:43
3

@BrandonIbbotson You should try: `urllib.parse.quote(url)` instead of `url.encode('utf-8')` You can read more about it here: https://docs.python.org/dev/library/urllib.parse.html#urllib.parse.quote – Snooze Feb 08 '17 at 22:44

score 3 · Answer 5 · answered Apr 16 '19 at 07:57

I'm having a similar case, right now. I'm trying to download images. I retrieve the URLs from the server in a JSON file. Some of the images contain non-ASCII characters. This throws an error:

for image in product["images"]: 
    filename = os.path.basename(image) 
    filepath = product_path + "/" + filename 
    urllib.request.urlretrieve(image, filepath) # error!

UnicodeEncodeError: 'ascii' codec can't encode character '\xc7' in position ...

I've tried using .encode("UTF-8"), but can't say it helped:

# coding=UTF-8
import urllib
url = u"http://example.com/wp-content/uploads/2018/09/İMAGE-1.png"
url = url.encode("UTF-8")
urllib.request.urlretrieve(url, "D:\image-1.jpg")

This just throws another error:

TypeError: cannot use a string pattern on a bytes-like object

Then I gave urllib.parse.quote(url) a go:

import urllib
url = "http://example.com/wp-content/uploads/2018/09/İMAGE-1.png"
url = urllib.parse.quote(url)
urllib.request.urlretrieve(url, "D:\image-1.jpg")

and again, this throws another error:

ValueError: unknown url type: 'http%3A//example.com/wp-content/uploads/2018/09/%C4%B0MAGE-1.png'

The : in "http://..." also got escaped, and I think this is the cause of the problem.

So, I've figured out a workaround. I just quote/escape the path, not the whole URL.

import urllib.request
import urllib.parse
url = "http://example.com/wp-content/uploads/2018/09/İMAGE-1.png"
url = urllib.parse.urlparse(url)
url = url.scheme + "://" + url.netloc + urllib.parse.quote(url.path)
urllib.request.urlretrieve(url, "D:\image-1.jpg")

This is what the URL looks like: "http://example.com/wp-content/uploads/2018/09/%C4%B0MAGE-1.png", and now I can download the image.

Can't open Unicode URL with Python

5 Answers5

Linked

Related