How to fetch a non-ascii url with urlopen?

Question

I need to fetch data from a URL with non-ascii characters but urllib2.urlopen refuses to open the resource and raises:

UnicodeEncodeError: 'ascii' codec can't encode character u'\u0131' in position 26: ordinal not in range(128)

I know the URL is not standards compliant but I have no chance to change it.

What is the way to access a resource pointed by a URL containing non-ascii characters using Python?

edit: In other words, can / how urlopen open a URL like:

http://example.org/Ñöñ-ÅŞÇİİ/

bobince · Accepted Answer · 2010-12-08T19:23:12.450

57

Strictly speaking URIs can't contain non-ASCII characters; what you have there is an IRI.

To convert an IRI to a plain ASCII URI:

non-ASCII characters in the hostname part of the address have to be encoded using the Punycode-based IDNA algorithm;
non-ASCII characters in the path, and most of the other parts of the address have to be encoded using UTF-8 and %-encoding, as per Ignacio's answer.

So:

import re, urlparse

def urlEncodeNonAscii(b):
    return re.sub('[\x80-\xFF]', lambda c: '%%%02x' % ord(c.group(0)), b)

def iriToUri(iri):
    parts= urlparse.urlparse(iri)
    return urlparse.urlunparse(
        part.encode('idna') if parti==1 else urlEncodeNonAscii(part.encode('utf-8'))
        for parti, part in enumerate(parts)
    )

>>> iriToUri(u'http://www.a\u0131b.com/a\u0131b')
'http://www.xn--ab-hpa.com/a%c4%b1b'

(Technically this still isn't quite good enough in the general case because urlparse doesn't split away any user:pass@ prefix or :port suffix on the hostname. Only the hostname part should be IDNA encoded. It's easier to encode using normal urllib.quote and .encode('idna') at the time you're constructing a URL than to have to pull an IRI apart.)

edited Dec 08 '10 at 19:23

answered Dec 08 '10 at 19:13

bobince

528,062
107
651
834

1

Although this seems to be a very niche problem, it's sure resolved a very specific issue of my own. Great answer. – Llanilek Nov 03 '13 at 01:10
1

How to handle this elegantly in Python 3? Any suggestions? – zeekvfu Aug 14 '14 at 10:44
1

This actually works great for serving files where the name may contain non-american characters such as kanji symbols! – Mike McMahon Sep 17 '14 at 18:45
2

in python 3 you `import urllib.parse` instead of `urlparse`, decode b in urlEncodeNonAscii: `b.decode('utf-8')` and leave the idna part out of the iriToUri: `return urllib.parse.urlunparse([url_encode_non_ascii(part.encode('utf-8')) for part in parts])` – RvdBerg Dec 28 '15 at 12:36
Using UTF-8 for query is not always correct; details are in my answer. Web is a weird place. – Mikhail Korobov Nov 17 '16 at 12:01
AttributeError: module 'urllib' has no attribute 'unparse' – Mona Jalal Apr 02 '18 at 03:48
This doesn't encode spaces. To add that, include something like `re.sub(r'\s', '+', str)` – Tad Nov 07 '19 at 12:23

score 47 · Answer 2 · answered Mar 24 '15 at 11:32

47

In python3, use the urllib.parse.quote function on the non-ascii string:

>>> from urllib.request import urlopen                                                                                                                                                            
>>> from urllib.parse import quote                                                                                                                                                                
>>> chinese_wikipedia = 'http://zh.wikipedia.org/wiki/Wikipedia:' + quote('首页')
>>> urlopen(chinese_wikipedia)

answered Mar 24 '15 at 11:32

Perry

1,709
1
17
25

5

Simple and effective! :D – bodruk Nov 12 '15 at 01:58
5

Much better than the other answers. – jigglypuff Feb 17 '18 at 04:12
2

This is a wonderful solution. Solved my issue when using Kanji with urls, works with Japanese character sets. – Programming_Learner_DK Apr 06 '19 at 10:32
2

WOW This is underrated – Polamin Singhasuwich Dec 03 '19 at 21:01
Note this doesn't handle the hostname (IDNA) correctly. – Alex Shpilkin Dec 10 '20 at 17:03
Worked for me for greek and strange chars like o with umlaut (German) – Thalis May 23 '21 at 15:50

score 25 · Answer 3 · answered Aug 16 '13 at 08:56

25

Python 3 has libraries to handle this situation. Use urllib.parse.urlsplit to split the URL into its components, and urllib.parse.quote to properly quote/escape the unicode characters and urllib.parse.urlunsplit to join it back together.

>>> import urllib.parse
>>> url = 'http://example.com/unicodè'
>>> url = urllib.parse.urlsplit(url)
>>> url = list(url)
>>> url[2] = urllib.parse.quote(url[2])
>>> url = urllib.parse.urlunsplit(url)
>>> print(url)
http://example.com/unicod%C3%A8

answered Aug 16 '13 at 08:56

darkfeline

9,404
5
31
32

1

@user230137 What do you mean it doesn't work? Works perfectly for me. – darkfeline Oct 11 '14 at 18:13
Note this doesn't handle the hostname (IDNA) correctly. – Alex Shpilkin Dec 10 '20 at 17:03
urllib.parse.quote(url, safe=':/') – dr.Pep Dec 11 '21 at 09:36

Mikhail Korobov · Answer 4 · 2018-09-11T12:12:43.910

It is more complex than the accepted @bobince's answer suggests:

netloc should be encoded using IDNA;
non-ascii URL path should be encoded to UTF-8 and then percent-escaped;
non-ascii query parameters should be encoded to the encoding of a page URL was extracted from (or to the encoding server uses), then percent-escaped.

This is how all browsers work; it is specified in https://url.spec.whatwg.org/ - see this example. A Python implementation can be found in w3lib (this is the library Scrapy is using); see w3lib.url.safe_url_string:

from w3lib.url import safe_url_string
url = safe_url_string(u'http://example.org/Ñöñ-ÅŞÇİİ/', encoding="<page encoding>")

An easy way to check if a URL escaping implementation is incorrect/incomplete is to check if it provides 'page encoding' argument or not.

score 7 · Answer 5 · answered Feb 17 '17 at 23:15

7

Based on @darkfeline answer:

from urllib.parse import urlsplit, urlunsplit, quote

def iri2uri(iri):
    """
    Convert an IRI to a URI (Python 3).
    """
    uri = ''
    if isinstance(iri, str):
        (scheme, netloc, path, query, fragment) = urlsplit(iri)
        scheme = quote(scheme)
        netloc = netloc.encode('idna').decode('utf-8')
        path = quote(path)
        query = quote(query)
        fragment = quote(fragment)
        uri = urlunsplit((scheme, netloc, path, query, fragment))

    return uri

answered Feb 17 '17 at 23:15

Ukr

2,411
18
16

This has some issues 1) scheme does not support percentage encoded chars, 2) path and query have different chars that are safe than path and those should not be percent-encoded, the default safe chars for quote is for the path component - werkzeug has a better iri2uri implementation [[ref](https://github.com/pallets/werkzeug/blob/92c6380248c7272ee668e1f8bbd80447027ccce2/src/werkzeug/urls.py#L926-L931)]. – Iwan Aucamp Mar 21 '23 at 00:11

score 5 · Answer 6 · answered May 22 '16 at 15:39

5

For those not depending strictly on urllib, one practical alternative is requests, which handles IRIs "out of the box".

For example, with http://bücher.ch:

>>> import requests
>>> r = requests.get(u'http://b\u00DCcher.ch')
>>> r.status_code
200

answered May 22 '16 at 15:39

h7r

4,944
2
28
31

Amazing! Thanks! In my case I was trying to download a file (.png) in bytes format. My original code: `urllib.request.urlopen(url).read()` The code after the change to requests: `requests.get(url).content` – AngryCoder Jun 03 '22 at 10:01

score 4 · Answer 7 · answered Dec 08 '10 at 16:07

4

Encode the unicode to UTF-8, then URL-encode.

answered Dec 08 '10 at 16:07

Ignacio Vazquez-Abrams

776,304
153
1,341
1,358

thanks for the response. can you be more specific please? `unicode(url, 'utf-8')` raises `TypeError: decoding Unicode is not supported`. also which function do you suggest for encoding url? urlencode for example is for building query string. but mine is only a path on the server. – onurmatik Dec 08 '10 at 16:16
2

http://farmdev.com/talks/unicode/ http://docs.python.org/library/urllib.html#urllib.quote – Ignacio Vazquez-Abrams Dec 08 '10 at 16:23
2

For the first part, you want `url.encode('utf-8')` (assuming `url` is a `unicode` object). – Karl Knechtel Dec 08 '10 at 16:38
@ignacio: thanks. i still think the problem is with the urlopen not accepting non-ascii characters as a URL (which it is right in a way, as they are not standard). please see my update in question. – onurmatik Dec 08 '10 at 16:39

score 4 · Answer 8 · answered Feb 28 '12 at 13:22

4

Use iri2uri method of httplib2. It makes the same thing as by bobin (is he/she the author of that?)

answered Feb 28 '12 at 13:22

Thorin Schiffer

2,818
4
25
34

1

The proposed solution doesn't work for non-ASCII domain names (IRI). `urllib2.urlopen(httplib2.iri2uri("http://домены.рф"), timeout=15)` returns **urlopen error [Errno -2] Name or service not known** – maxkoryukov Aug 23 '18 at 21:56

shuuji3 · Answer 9 · 2021-10-17T06:18:56.150

Another option to convert an IRI to an ASCII URI is to use furl package:

gruns/furl: URL parsing and manipulation made easy. - https://github.com/gruns/furl

Python's standard urllib and urlparse modules provide a number of URL related functions, but using these functions to perform common URL operations proves tedious. Furl makes parsing and manipulating URLs easy.

Examples

Non-ASCII domain

http://国立極地研究所.jp/english/ (Japanese National Institute of Polar Research website)

import furl

url = 'http://国立極地研究所.jp/english/'
furl.furl(url).tostr()

'http://xn--vcsoey76a2hh0vtuid5qa.jp/english/'

Non-ASCII path

https://ja.wikipedia.org/wiki/日本語 ("Japanese" article in Wikipedia)

import furl

url = 'https://ja.wikipedia.org/wiki/日本語'
furl.furl(url).tostr()

'https://ja.wikipedia.org/wiki/%E6%97%A5%E6%9C%AC%E8%AA%9E'

score 0 · Answer 10 · answered Oct 31 '18 at 06:18

works! finally

I could not avoid from this strange characters, but at the end I come through it.

import urllib.request
import os


url = "http://www.fourtourismblog.it/le-nuove-tendenze-del-marketing-tenere-docchio/"
with urllib.request.urlopen(url) as file:
    html = file.read()
with open("marketingturismo.html", "w", encoding='utf-8') as file:
    file.write(str(html.decode('utf-8')))
os.system("marketingturismo.html")

How to fetch a non-ascii url with urlopen?

10 Answers10

Examples

Non-ASCII domain

Non-ASCII path

works! finally

Linked

Related