14

Is there a standard function to check an IRI, to check an URL apparently I can use:

parts = urlparse.urlsplit(url)  
    if not parts.scheme or not parts.netloc:  
        '''apparently not an url'''

I tried the above with an URL containing Unicode characters:

import urlparse
url = "http://fdasdf.fdsfîășîs.fss/ăîăî"
parts = urlparse.urlsplit(url)
if not parts.scheme or not parts.netloc:  
    print "not an url"
else:
    print "yes an url"

and what I get is yes an url. Does this means I'm good an this tests for valid IRI? Is there another way ?

BenMorel
  • 34,448
  • 50
  • 182
  • 322
Eduard Florinescu
  • 16,747
  • 28
  • 113
  • 179
  • Why shouldn't you be good? Does your example violate any rule defined by the IRI standard? In other words: are you asking us if your test breaks any IRI rules? Did you perform this research yourself? – Dr. Jan-Philip Gehrcke Sep 24 '12 at 12:38
  • @Jan-PhilipGehrcke I am asking someone who has more experience than me with IRI, if I am good with this. – Eduard Florinescu Sep 24 '12 at 12:40

2 Answers2

20

Using urlparse is not sufficient to test for a valid IRI.

Use the rfc3987 package instead:

from rfc3987 import parse

parse('http://fdasdf.fdsfîășîs.fss/ăîăî', rule='IRI')
Martijn Pieters
  • 1,048,767
  • 296
  • 4,058
  • 3,343
  • 3
    `ImportError: No module named rfc3987` so it is not standard, `pip install rfc3987` – Eduard Florinescu Sep 24 '12 at 12:52
  • 1
    You have to install the package he links to – David Robinson Sep 24 '12 at 12:53
  • Works (+1), accept, and you are right with:`Using urlparse is not sufficient to test for a valid IRI`, because with the code provided above `url` string is not a valid IRI, . – Eduard Florinescu Sep 24 '12 at 13:00
  • 1
    But escaped works: `parse('http://fdasdf.fdsf%C3%AE%C4%83%C8%99%C3%AEs.com/%C4%83%C3%AE%C4%83%C3%AE', rule='IRI')` I get: `{'fragment': None, 'path': '/%C4%83%C3%AE%C4%83%C3%AE', 'scheme': 'http', 'authority': 'fdasdf.fdsf%C3%AE%C4%83%C8%99%C3%AEs.com', 'query': None}` – Eduard Florinescu Sep 24 '12 at 13:12
  • Notably, SSH format doesn't comply with URI or IRI: http://unix.stackexchange.com/q/75668/61349 – ThorSummoner Oct 09 '15 at 23:09
  • 1
    I only wish that more people would find this answer when googling. – Devin Jun 14 '17 at 19:01
1

The only character-set-sensitive code in the implementation of urlparse is requiring that the scheme should contain only ASCII letters, digits and [+-.] characters; otherwise it's completely agnostic so will work fine with non-ASCII characters.

As this is non-documented behaviour, it's your responsibility to check that it continues to be the case (with tests in your project), but I don't imagine it would be changed to break IRIs.

urllib provides quoting functions to convert IRIs to/from ASCII URIs, although they still don't mention IRIs explicitly in the documentation, and they are broken in some cases: Is there a unicode-ready substitute I can use for urllib.quote and urllib.unquote in Python 2.6.5?

Community
  • 1
  • 1
ecatmur
  • 152,476
  • 27
  • 293
  • 366
  • `urllib.quote(url)` seems to escape the `:` colon in the `http://` to `http%3A//` – Eduard Florinescu Sep 24 '12 at 13:15
  • 1
    @EduardFlorinescu yes, by default it only works for quoting the path section of an IRI; for a full IRI you'd need to parse, quote, and reassemble the components. – ecatmur Sep 24 '12 at 13:28