19

I've been trying to figure out what the best way to validate a URL is (specifically in Python) but haven't really been able to find an answer. It seems like there isn't one known way to validate a URL, and it depends on what URLs you think you may need to validate. As well, I found it difficult to find an easy to read standard for URL structure. I did find the RFCs 3986 and 3987, but they contain much more than just how it is structured.

Am I missing something, or is there no one standard way to validate a URL?

mp94
  • 1,209
  • 3
  • 11
  • 23
  • what are you asking? You want to know if a domain is in a correct format? Where is your code? – Trent Mar 06 '14 at 23:11
  • possible duplicate of [How do you validate a URL with a regular expression in Python?](http://stackoverflow.com/questions/827557/how-do-you-validate-a-url-with-a-regular-expression-in-python) – Blair Mar 06 '14 at 23:17

5 Answers5

27

This looks like it might be a duplicate of How do you validate a URL with a regular expression in Python?

You should be able to use the urlparse library described there.

>>> from urllib.parse import urlparse # python2: from urlparse import urlparse
>>> urlparse('actually not a url')
ParseResult(scheme='', netloc='', path='actually not a url', params='', query='', fragment='')
>>> urlparse('http://google.com')
ParseResult(scheme='http', netloc='google.com', path='', params='', query='', fragment='')

call urlparse on the string you want to check and then make sure that the ParseResult has attributes for scheme and netloc

faruk13
  • 1,276
  • 1
  • 16
  • 23
bgschiller
  • 2,087
  • 1
  • 16
  • 30
  • 6
    You might want to use `rfc3987` (https://pypi.python.org/pypi/rfc3987) or do more processing on the urlparse result. urlparse won't actually validate a netloc as an "internet url" -- i got bitten by this too. `urlparse('http://invalidurl') will give you a netloc + scheme. – Jonathan Vanasco Jul 11 '14 at 22:35
  • @JonathanVanasco, `python -c "import urlparse; print urlparse.urlparse('invalidurl')"` gives `ParseResult(scheme='', netloc='', path='invalidurl', params='', query='', fragment='')`, so no `netloc` or `scheme`. But that does look like a better package for this problem, as it also provides validation. – bgschiller Jul 12 '14 at 05:37
  • Sorry, the formatting screwed up the display and autolinked on my original comment. I had indtended `urlparse.urlparse('http://invalidurl')` - notice the scheme was stripped from the original. the `urlparse` module interprets 'invalidurl' as a hostname for the netloc -- that's a correct interpretation for the general format, but most people don't intend for stuff like that to pass through. i've encountered too many typos like `http://example.com` -> `http://examplecom`. if you pass in ip addresses, it doesn't enforce ipv4 or ipv6 either, so it will accept `999.999.999.999.999` too. – Jonathan Vanasco Jul 12 '14 at 19:16
  • 1
    It does look like that's a more strict parser, but `rfc3987` lets through both of those cases as well (`999.999.999.999.999.999` and `http://examplecom`). – bgschiller Jul 13 '14 at 15:20
  • 1
    In python3 `import urllib.parse as urlparse` – gies0r Mar 16 '17 at 09:34
  • 1
    @gies0r this should probably be `from urllib.parse import urlparse` as the code above imports the whole parse module – basse Aug 13 '17 at 20:53
  • 1
    So "x://a.bc.1" is a valid URL (scheme='x', netloc='a.bc.1') and apple.de not (scheme='', netloc='') !? Not really practical… – oxidworks Feb 15 '18 at 18:02
22

The original question is a bit old, but you might also want to look at the Validator-Collection library I released a few months back. It includes high-performing regex-based validation of URLs for compliance against the RFC standard. Some details:

  • Tested against Python 2.7, 3.4, 3.5, 3.6, 3.7, and 3.8
  • No dependencies on Python 3.x, one conditional dependency in Python 2.x (drop-in replacement for Python 2.x's buggy re module)
  • Unit tests that cover 100+ different succeeding/failing URL patterns, including non-standard characters and the like. As close to covering the whole spectrum of the RFC standard as I've been able to find.

It's also very easy to use:

from validator_collection import validators, checkers

checkers.is_url('http://www.stackoverflow.com')
# Returns True

checkers.is_url('not a valid url')
# Returns False

value = validators.url('http://www.stackoverflow.com')
# value set to 'http://www.stackoverflow.com'

value = validators.url('not a valid url')
# raises a validator_collection.errors.InvalidURLError (which is a ValueError)

value = validators.url('https://123.12.34.56:1234')
# value set to 'https://123.12.34.56:1234'

value = validators.url('http://10.0.0.1')
# raises a validator_collection.errors.InvalidURLError (which is a ValueError)

value = validators.url('http://10.0.0.1', allow_special_ips = True)
# value set to 'http://10.0.0.1'

In addition, Validator-Collection includes about 60+ other validators, including IP addresses (IPv4 and IPv6), domains, and email addresses as well, so something folks might find useful.

Chris Modzelewski
  • 1,351
  • 10
  • 10
  • 1
    This looks like a really nice package. I haven't tried it yet, but it deserves more than 0 upvotes :-). – Dave Dec 03 '18 at 18:29
  • This only works with domain names - it doesn't appear to like ip addresses though. proxy.remote.http: 'http://XX.XXX.X.XXX:XXXX/' is not a url. proxy.remote.https: 'http://XX.XXX.X.XXX:XXXX/' is not a url. – FiferJanis Jan 24 '20 at 18:57
  • Note sure I understand what exactly you mean. The value `XX.XXX.X.XXX:XXXX` will never validate correctly because a) it does not have a valid protocol, and because b) the port (`:XXXX`) is not expressed as a valid port address. If you try to validate `http://XX.XXX.X.XXX:1234` that will validate correctly. If you try to validate an IP `http://123.165.43.12:1234` that will validate as well. What's the exact issue that you're encountering? – Chris Modzelewski Jan 25 '20 at 00:08
  • Also - a follow-up: there are certain special IP addresses (like loopback IPs like `127.0.0.1` or `0.0.0.0`) which are considered special cases by the RFCs for URLs and IP addresses. By default, they will fail validation. However, you can have them be allowed (pass validation) by passing the `allow_special_ips = True` parameter to the validator function. More details in the documentation. – Chris Modzelewski Jan 25 '20 at 00:11
1

I would use the validators package. Here is the link to the documentation and installation instructions.

It is just as simple as

import validators
url = 'YOUR URL'
validators.url(url)

It will return true if it is, and false if not.

  • The following fails print(validators.url("apple.com")) – Larytet Dec 01 '19 at 09:50
  • 1
    @Larytet Because that's not a valid url. – Joshua Wolff Jun 15 '20 at 20:09
  • However, I found a case in which validators fails. https:// seekingalpha dot com/article/4353927/track?type=cli....traºnner_utm_.... Elimintating the extra stuff with "..." The "º" is not detected and validators returns True. In fact, this URL is not valid – Joshua Wolff Jun 15 '20 at 20:09
1

you can also try using urllib.request to validate by passing the URL in the urlopen function and catching the exception for URLError.

from urllib.request import urlopen, URLError

def validate_web_url(url="http://google"):
    try:
        urlopen(url)
        return True
    except URLError:
        return False

This would return False in this case

Hamza
  • 81
  • 6
-1

Assuming you are using python 3, you could use urllib. The code would go something like this:

import urllib.request as req
import urllib.parse as p

def foo():
    url = 'http://bar.com'
    request = req.Request(url)
    try:
        response = req.urlopen(request)
        #response is now a string you can search through containing the page's html
    except:
        #The url wasn't valid

If there is no error on the line "response = ..." then the url is valid.

mdw7326
  • 163
  • 1
  • 1
  • 9
  • 4
    This only works if the host has an internet connection, which may not always be true. – bgschiller Mar 06 '14 at 23:52
  • 4
    It would be preferable to not have to use an internet connection to determine if the URL is valid. Also using Python 2.7, should have specified that in the original question. – mp94 Mar 07 '14 at 03:55