Python: How to check if a string is a valid IRI?

Question

Is there a standard function to check an IRI, to check an URL apparently I can use:

parts = urlparse.urlsplit(url)  
    if not parts.scheme or not parts.netloc:  
        '''apparently not an url'''

I tried the above with an URL containing Unicode characters:

import urlparse
url = "http://fdasdf.fdsfîășîs.fss/ăîăî"
parts = urlparse.urlsplit(url)
if not parts.scheme or not parts.netloc:  
    print "not an url"
else:
    print "yes an url"

and what I get is yes an url. Does this means I'm good an this tests for valid IRI? Is there another way ?

Why shouldn't you be good? Does your example violate any rule defined by the IRI standard? In other words: are you asking us if your test breaks any IRI rules? Did you perform this research yourself? — Dr. Jan-Philip Gehrcke, Sep 24 '12 at 12:38
@Jan-PhilipGehrcke I am asking someone who has more experience than me with IRI, if I am good with this. — Eduard Florinescu, Sep 24 '12 at 12:40

score 20 · Accepted Answer · answered Sep 24 '12 at 12:46

20

Using urlparse is not sufficient to test for a valid IRI.

Use the rfc3987 package instead:

from rfc3987 import parse

parse('http://fdasdf.fdsfîășîs.fss/ăîăî', rule='IRI')

answered Sep 24 '12 at 12:46

Martijn Pieters

1,048,767
296
4,058
3,343

3

`ImportError: No module named rfc3987` so it is not standard, `pip install rfc3987` – Eduard Florinescu Sep 24 '12 at 12:52
1

You have to install the package he links to – David Robinson Sep 24 '12 at 12:53
Works (+1), accept, and you are right with:`Using urlparse is not sufficient to test for a valid IRI`, because with the code provided above `url` string is not a valid IRI, . – Eduard Florinescu Sep 24 '12 at 13:00
1

But escaped works: `parse('http://fdasdf.fdsf%C3%AE%C4%83%C8%99%C3%AEs.com/%C4%83%C3%AE%C4%83%C3%AE', rule='IRI')` I get: `{'fragment': None, 'path': '/%C4%83%C3%AE%C4%83%C3%AE', 'scheme': 'http', 'authority': 'fdasdf.fdsf%C3%AE%C4%83%C8%99%C3%AEs.com', 'query': None}` – Eduard Florinescu Sep 24 '12 at 13:12
Notably, SSH format doesn't comply with URI or IRI: http://unix.stackexchange.com/q/75668/61349 – ThorSummoner Oct 09 '15 at 23:09
1

I only wish that more people would find this answer when googling. – Devin Jun 14 '17 at 19:01

score 1 · Answer 2 · edited May 23 '17 at 11:48

1

The only character-set-sensitive code in the implementation of urlparse is requiring that the scheme should contain only ASCII letters, digits and [+-.] characters; otherwise it's completely agnostic so will work fine with non-ASCII characters.

As this is non-documented behaviour, it's your responsibility to check that it continues to be the case (with tests in your project), but I don't imagine it would be changed to break IRIs.

urllib provides quoting functions to convert IRIs to/from ASCII URIs, although they still don't mention IRIs explicitly in the documentation, and they are broken in some cases: Is there a unicode-ready substitute I can use for urllib.quote and urllib.unquote in Python 2.6.5?

edited May 23 '17 at 11:48

Community

1
1

answered Sep 24 '12 at 12:41

ecatmur

152,476
27
293
366

`urllib.quote(url)` seems to escape the `:` colon in the `http://` to `http%3A//` – Eduard Florinescu Sep 24 '12 at 13:15
1

@EduardFlorinescu yes, by default it only works for quoting the path section of an IRI; for a full IRI you'd need to parse, quote, and reassemble the components. – ecatmur Sep 24 '12 at 13:28

Python: How to check if a string is a valid IRI?

2 Answers2