213

I have url from the user and I have to reply with the fetched HTML.

How can I check for the URL to be malformed or not?

For example :

url = 'google' # Malformed
url = 'google.com' # Malformed
url = 'http://google.com' # Valid
url = 'http://google' # Malformed
Wai Ha Lee
  • 8,598
  • 83
  • 57
  • 92
Yugal Jindle
  • 44,057
  • 43
  • 129
  • 197
  • 2
    Just try to read it, if for instance httplib throws an exception, then you'll know it was invalid. _Not all well formed urls are valid_! – carlpett Aug 23 '11 at 12:07
  • 23
    `url='http://google' ` is not malformed. Schema + hostname is always valid. – Viktor Joras Nov 04 '18 at 06:53

17 Answers17

222

Use the validators package:

>>> import validators
>>> validators.url("http://google.com")
True
>>> validators.url("http://google")
ValidationFailure(func=url, args={'value': 'http://google', 'require_tld': True})
>>> if not validators.url("http://google"):
...     print "not valid"
... 
not valid
>>>

Install it from PyPI with pip (pip install validators).

Asclepius
  • 57,944
  • 17
  • 167
  • 143
Jabba
  • 19,598
  • 6
  • 52
  • 45
152

Actually, I think this is the best way.

from django.core.validators import URLValidator
from django.core.exceptions import ValidationError

val = URLValidator(verify_exists=False)
try:
    val('http://www.google.com')
except ValidationError, e:
    print e

If you set verify_exists to True, it will actually verify that the URL exists, otherwise it will just check if it's formed correctly.

edit: ah yeah, this question is a duplicate of this: How can I check if a URL exists with Django’s validators?

Community
  • 1
  • 1
Drekembe
  • 2,620
  • 2
  • 16
  • 13
  • 66
    But this will only work in the django environment not otherwise. – Yugal Jindle Aug 23 '11 at 12:22
  • Oh sorry I don't know why I thought this question had the django tag. Yikes, sorry. – Drekembe Aug 23 '11 at 12:38
  • 28
    `verify_exists` is deprecated. -1 –  Jul 02 '13 at 16:17
  • 2
    Add: from django.conf import settings settings.configure(DEBUG=False) and remove the verify_exists to keep it working with django 1.5 – Dukeatcoding Aug 05 '13 at 13:22
  • 1
    @YugalJindle Correct, but stripping it from Django is almost trivial :D. So, I use this method – swdev Aug 29 '14 at 23:17
  • 9
    Note, with django >= 1.5 there is no `verify_exists` anymore. Also instead of the `val` variable you can call it like `URLValidator()('http://www.google.com')` – luckydonald Sep 21 '16 at 17:04
  • Would it make sense to import Django just to do a URL validation? – pbreitenbach Sep 10 '19 at 03:29
  • What does the val function return though? That needs to be specified here. I tested it out, and the val function returns None if the argument is a url and throws a validation error otherwise. – kloddant Jul 29 '21 at 16:51
134

A True or False version, based on @DMfll answer:

try:
    # python2
    from urlparse import urlparse
except:
    # python3
    from urllib.parse import urlparse

a = 'http://www.cwi.nl:80/%7Eguido/Python.html'
b = '/data/Python.html'
c = 532
d = u'dkakasdkjdjakdjadjfalskdjfalk'
e = 'https://stackoverflow.com'

def uri_validator(x):
    try:
        result = urlparse(x)
        return all([result.scheme, result.netloc])
    except:
        return False

print(uri_validator(a))
print(uri_validator(b))
print(uri_validator(c))
print(uri_validator(d))
print(uri_validator(e))

Gives:

True
False
False
False
True
xavdid
  • 5,092
  • 3
  • 20
  • 32
alemol
  • 8,058
  • 2
  • 24
  • 29
  • 11
    I didn't know you could test an if statement with a list of non-None elements. That's helpful. Also +1 for using a built-in module – Marc Maxmeister Aug 04 '16 at 17:05
  • Empty lists maps to False in condition contexts. – alemol Aug 05 '16 at 15:22
  • 19
    This allows everything. It returns `True` for the string `fake` or even for a blank string. There will never be any errors because those attributes are always there, and the list will always have a boolean value of True because it contains those attributes. Even if all of the attributes are None, the list will still be non-empty. You need some validation of the attributes because everything passes the way you have it now. – zondo Oct 13 '16 at 12:58
  • 3
    Lists of false objects evaluate to True: `print("I am true") if [False, None, 0, '', [], {}] else print("I am false.")` prints "I am true." when I run it. `[result.scheme, result.netloc, result.path]` always evaluates to `True`. `print("I am True") if [] else print("I am False.")` prints "I am false." so empty lists are False. The contents of the array needs evaluation with something like the `all` function. – dmmfll Nov 11 '16 at 14:50
  • I've edited it to `return result.scheme and result.netloc and result.path` instead of doing a comparison with a non-empty list (which is always `True` as noted by others above.) – Peter Wood Oct 19 '17 at 08:33
  • 3
    Not sure why you would require a path like that. You should remove `result.path` from the test. – Jerinaw Jun 18 '19 at 14:01
  • 3
    This is good enough for me, thanks. I just added a simple validation for `scheme`: `if not all([result.scheme in ["file", "http", "https"], result.netloc, result.path]):` – Alexander Fortin Feb 13 '20 at 07:02
  • 1
    @AlexanderFortin, right, the `scheme` is allowed to be a whatever string here, so this check is needed. For example blocking `javascript:` could prevent XSS attacks. The best method here is to whitelist few allowed schemes, as you have done in your comment. – adamczi May 08 '20 at 09:11
  • 2
    Rotten algorithm. This validates malformed URIs, like `https://https://https://www.foo.bar` as `ParseResult(scheme='https', netloc='https:', path='//https://www.foo.bar', params='', query='', fragment='')`. **IN OTHER WORDS IT PASSES MANGLED URIs.** – ingyhere Jun 28 '20 at 22:40
  • @alemol Can you specify what type of error is in except? https://docs.python.org/3/tutorial/errors.html – Lukkar Aug 14 '20 at 16:51
  • 1
    This returns false for a file URL: `file:///home/user` – xuhdev Dec 01 '20 at 05:20
  • For those who likes on-liners: `is_valid = all(list(urllib.parse.urlparse(url))[:2])` – milembar Dec 09 '20 at 10:07
  • 1
    +1 for maybe being a bit more careful with such examples ... you should really think twice before encouraging others to use 'catch-all' exception handlers with an SO reply. Not sure what a good reference for this is, but surely you're risking hiding errors that are misinterpreted. Perhaps investigate at least what classes to watch out for? https://en.wikipedia.org/wiki/Error_hiding (See jonaprieto's answer with ValueError for a possible resolution - I myself haven't verified that.) – brezniczky Feb 20 '21 at 15:54
  • This returns false for urls without a path - e.g.: https://stackoverflow.com – tofarr Apr 26 '21 at 20:15
  • 2
    I've updated the example to not require a path, as that's not required – xavdid Jul 11 '21 at 00:59
133

django url validation regex (source):

import re
regex = re.compile(
        r'^(?:http|ftp)s?://' # http:// or https://
        r'(?:(?:[A-Z0-9](?:[A-Z0-9-]{0,61}[A-Z0-9])?\.)+(?:[A-Z]{2,6}\.?|[A-Z0-9-]{2,}\.?)|' #domain...
        r'localhost|' #localhost...
        r'\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3})' # ...or ip
        r'(?::\d+)?' # optional port
        r'(?:/?|[/?]\S+)$', re.IGNORECASE)

print(re.match(regex, "http://www.example.com") is not None) # True
print(re.match(regex, "example.com") is not None)            # False
Community
  • 1
  • 1
cetver
  • 11,279
  • 5
  • 36
  • 56
  • 2
    a curiosity... did you add the `ftp`? Or have I an old django version? – Ruggero Turra Aug 23 '11 at 12:23
  • >>wiso: django version 1.3 ( make sure yourself: /django/core/validators.py, line:47 ) ftp://someftp.com - invalid url ? Even stackoferlow parser someftp.com makes as link ) – cetver Aug 23 '11 at 13:04
  • >>Yugal Jindle: django developers thinks protocol is required, but you can modify this regexp if you do not think so – cetver Aug 23 '11 at 13:06
  • Sorry, I meant url='http://www.google' is malformed right ? But it matches the regex.. so can something be done for that ? – Yugal Jindle Aug 25 '11 at 07:54
  • 2
    @yugal-jindle http://www.sitedomain is not a valid url. http://www.museum/ is because .museum is a top-level-domain (ICANN [1] defines them), and not a sitedomain. [1] http://www.icann.org/ – glarrain Oct 10 '12 at 16:50
  • 1
    This one doesn't seem to work with http://username:password@example.com style URLs – Adam Baxter Aug 15 '15 at 19:31
  • does anybody have a link to the source in its context? – cowlinator Aug 29 '17 at 23:33
  • 1
    @cowlinator https://github.com/django/django/blob/stable/1.3.x/django/core/validators.py#L45 – cetver Aug 30 '17 at 16:13
  • I found a bug: `httpbin.org` – Rob Truxal Jan 10 '18 at 06:16
  • 5
    This will not work for IPv6 urls, which have the form `http://[2001:0DB8::3]:8080/index.php?valid=true#result` – cimnine Feb 04 '18 at 20:42
  • url='http://google' is actually a valid url in most browsers. It is common to use on internal networks to allow the use of http://myserver/mypage instead of http://myserver.myoffice.mydomain.com for obvious reasons. Most people will want to allow the local short server name. – user6830669 Mar 31 '20 at 14:34
  • Can you explain how this regular expression works? `r'(?:(?:[A-Z0-9](?:[A-Z0-9-]{0,61}[A-Z0-9])?\.)+(?:[A-Z]{2,6}\.?|[A-Z0-9-]{2,}\.?)|' #domain...` – Aska Oct 11 '20 at 08:38
34

Nowadays, I use the following, based on the Padam's answer:

$ python --version
Python 3.6.5

And this is how it looks:

from urllib.parse import urlparse

def is_url(url):
  try:
    result = urlparse(url)
    return all([result.scheme, result.netloc])
  except ValueError:
    return False

Just use is_url("http://www.asdf.com").

Hope it helps!

Jonathan Prieto-Cubides
  • 2,577
  • 2
  • 18
  • 17
  • It fails in case the domain name begins with a dash, which is not valid. https://tools.ietf.org/html/rfc952 – Björn Lindqvist Mar 25 '19 at 18:37
  • 5
    This is only good to split up components in the special case that the URI is known to **NOT** be malformed. As I replied earlier to the other similar answer, this validates malformed URIs, like `https://https://https://www.foo.bar`. – ingyhere Jun 28 '20 at 22:46
  • As of Python 3.7.6, I tested this logic with "https://-wee.com" and it worked – Jesuisme Jul 07 '22 at 21:55
19

I landed on this page trying to figure out a sane way to validate strings as "valid" urls. I share here my solution using python3. No extra libraries required.

See https://docs.python.org/2/library/urlparse.html if you are using python2.

See https://docs.python.org/3.0/library/urllib.parse.html if you are using python3 as I am.

import urllib
from pprint import pprint

invalid_url = 'dkakasdkjdjakdjadjfalskdjfalk'
valid_url = 'https://stackoverflow.com'
tokens = [urllib.parse.urlparse(url) for url in (invalid_url, valid_url)]

for token in tokens:
    pprint(token)

min_attributes = ('scheme', 'netloc')  # add attrs to your liking
for token in tokens:
    if not all([getattr(token, attr) for attr in min_attributes]):
        error = "'{url}' string has no scheme or netloc.".format(url=token.geturl())
        print(error)
    else:
        print("'{url}' is probably a valid url.".format(url=token.geturl()))

ParseResult(scheme='', netloc='', path='dkakasdkjdjakdjadjfalskdjfalk', params='', query='', fragment='')

ParseResult(scheme='https', netloc='stackoverflow.com', path='', params='', query='', fragment='')

'dkakasdkjdjakdjadjfalskdjfalk' string has no scheme or netloc.

'https://stackoverflow.com' is probably a valid url.

Here is a more concise function:

from urllib.parse import urlparse

min_attributes = ('scheme', 'netloc')


def is_valid(url, qualifying=min_attributes):
    tokens = urlparse(url)
    return all([getattr(tokens, qualifying_attr)
                for qualifying_attr in qualifying])
dmmfll
  • 2,666
  • 2
  • 35
  • 41
10

note - lepl is no longer supported, sorry (you're welcome to use it, and i think the code below works, but it's not going to get updates).

rfc 3696 http://www.faqs.org/rfcs/rfc3696.html defines how to do this (for http urls and email). i implemented its recommendations in python using lepl (a parser library). see http://acooke.org/lepl/rfc3696.html

to use:

> easy_install lepl
...
> python
...
>>> from lepl.apps.rfc3696 import HttpUrl
>>> validator = HttpUrl()
>>> validator('google')
False
>>> validator('http://google')
False
>>> validator('http://google.com')
True
andrew cooke
  • 45,717
  • 10
  • 93
  • 143
6

EDIT

As pointed out by @Kwame , the below code does validate the url even if the .com or .co etc are not present.

also pointed out by @Blaise, URLs like https://www.google is a valid URL and you need to do a DNS check for checking if it resolves or not, separately.

This is simple and works:

So min_attr contains the basic set of strings that needs to be present to define the validity of a URL, i.e http:// part and google.com part.

urlparse.scheme stores http:// and

urlparse.netloc store the domain name google.com

from urlparse import urlparse
def url_check(url):

    min_attr = ('scheme' , 'netloc')
    try:
        result = urlparse(url)
        if all([result.scheme, result.netloc]):
            return True
        else:
            return False
    except:
        return False

all() returns true if all the variables inside it return true. So if result.scheme and result.netloc is present i.e. has some value then the URL is valid and hence returns True.

faruk13
  • 1,276
  • 1
  • 16
  • 23
Padam Sethia
  • 115
  • 1
  • 12
5

Here's a regex solution since top voted regex doesn't work for weird cases like top-level domain. Some test cases down below.

regex = re.compile(
    r"(\w+://)?"                # protocol                      (optional)
    r"(\w+\.)?"                 # host                          (optional)
    r"(([\w-]+)\.(\w+))"        # domain
    r"(\.\w+)*"                 # top-level domain              (optional, can have > 1)
    r"([\w\-\._\~/]*)*(?<!\.)"  # path, params, anchors, etc.   (optional)
)
cases = [
    "http://www.google.com",
    "https://www.google.com",
    "http://google.com",
    "https://google.com",
    "www.google.com",
    "google.com",
    "http://www.google.com/~as_db3.2123/134-1a",
    "https://www.google.com/~as_db3.2123/134-1a",
    "http://google.com/~as_db3.2123/134-1a",
    "https://google.com/~as_db3.2123/134-1a",
    "www.google.com/~as_db3.2123/134-1a",
    "google.com/~as_db3.2123/134-1a",
    # .co.uk top level
    "http://www.google.co.uk",
    "https://www.google.co.uk",
    "http://google.co.uk",
    "https://google.co.uk",
    "www.google.co.uk",
    "google.co.uk",
    "http://www.google.co.uk/~as_db3.2123/134-1a",
    "https://www.google.co.uk/~as_db3.2123/134-1a",
    "http://google.co.uk/~as_db3.2123/134-1a",
    "https://google.co.uk/~as_db3.2123/134-1a",
    "www.google.co.uk/~as_db3.2123/134-1a",
    "google.co.uk/~as_db3.2123/134-1a",
    "https://...",
    "https://..",
    "https://.",
    "https://.google.com",
    "https://..google.com",
    "https://...google.com",
    "https://.google..com",
    "https://.google...com"
    "https://...google..com",
    "https://...google...com",
    ".google.com",
    ".google.co."
    "https://google.co."
]
for c in cases:
    print(c, regex.match(c).span()[1] - regex.match(c).span()[0] == len(c))

Edit: Added hyphen to domain as suggested by nickh.

Táwros
  • 129
  • 3
  • 8
  • 1
    error in last line fixed: `print(c, x.span()[1] - x.span()[0] == len(c) if (x := regex.match(c)) else False)` – pmiguelpinto90 Nov 26 '21 at 13:38
  • Thanks Miguel, but I would like to warn others who do not use Python 3.8+ since ":=" is not valid for former versions. – Başar Söker Dec 29 '21 at 13:34
  • It doesn't match domains with hyphens, e.g https://api-example.com Consider using (\w+://)?(\w+\.)?(([\w-]+)\.(\w+))(\.\w+)*([\w\-\._\~/]*)*(?<!\.) – nickh Apr 26 '23 at 11:03
  • it also doesn't match a single word, for example `"fred"`. it gives the error `AttributeError: 'NoneType' object has no attribute 'span'` – colin0117 Aug 04 '23 at 11:00
  • @colin0117, this shouldn't match a single word. I recommend checking for that edge case in your code. – Táwros Aug 25 '23 at 14:22
  • @Táwros that's a problem though, it's not an edge case - it's an invalid URL - so the regex solution should reject it. – colin0117 Aug 29 '23 at 11:43
3

Validate URL with urllib and Django-like regex

The Django URL validation regex was actually pretty good but I needed to tweak it a little bit for my use case. Feel free to adapt it to yours!

Python 3.7

import re
import urllib

# Check https://regex101.com/r/A326u1/5 for reference
DOMAIN_FORMAT = re.compile(
    r"(?:^(\w{1,255}):(.{1,255})@|^)" # http basic authentication [optional]
    r"(?:(?:(?=\S{0,253}(?:$|:))" # check full domain length to be less than or equal to 253 (starting after http basic auth, stopping before port)
    r"((?:[a-z0-9](?:[a-z0-9-]{0,61}[a-z0-9])?\.)+" # check for at least one subdomain (maximum length per subdomain: 63 characters), dashes in between allowed
    r"(?:[a-z0-9]{1,63})))" # check for top level domain, no dashes allowed
    r"|localhost)" # accept also "localhost" only
    r"(:\d{1,5})?", # port [optional]
    re.IGNORECASE
)
SCHEME_FORMAT = re.compile(
    r"^(http|hxxp|ftp|fxp)s?$", # scheme: http(s) or ftp(s)
    re.IGNORECASE
)

def validate_url(url: str):
    url = url.strip()

    if not url:
        raise Exception("No URL specified")

    if len(url) > 2048:
        raise Exception("URL exceeds its maximum length of 2048 characters (given length={})".format(len(url)))

    result = urllib.parse.urlparse(url)
    scheme = result.scheme
    domain = result.netloc

    if not scheme:
        raise Exception("No URL scheme specified")

    if not re.fullmatch(SCHEME_FORMAT, scheme):
        raise Exception("URL scheme must either be http(s) or ftp(s) (given scheme={})".format(scheme))

    if not domain:
        raise Exception("No URL domain specified")

    if not re.fullmatch(DOMAIN_FORMAT, domain):
        raise Exception("URL domain malformed (domain={})".format(domain))

    return url

Explanation

  • The code only validates the scheme and netloc part of a given URL. (To do this properly, I split the URL with urllib.parse.urlparse() in the two according parts which are then matched with the corresponding regex terms.)
  • The netloc part stops before the first occurrence of a slash /, so port numbers are still part of the netloc, e.g.:

    https://www.google.com:80/search?q=python
    ^^^^^   ^^^^^^^^^^^^^^^^^
      |             |      
      |             +-- netloc (aka "domain" in my code)
      +-- scheme
    
  • IPv4 addresses are also validated

IPv6 Support

If you want the URL validator to also work with IPv6 addresses, do the following:

  • Add is_valid_ipv6(ip) from Markus Jarderot's answer, which has a really good IPv6 validator regex
  • Add and not is_valid_ipv6(domain) to the last if

Examples

Here are some examples of the regex for the netloc (aka domain) part in action:

winklerrr
  • 13,026
  • 8
  • 71
  • 88
3

Pydantic could be used to do that. I'm not very used to it so I can't say about it's limitations. It is an option thou and no one suggested it.

I have seen that many people questioned about ftp and files URL in previous answers so I recommend to get known to the documentation as Pydantic have many types for validation as FileUrl, AnyUrl and even database url types.

A simplistic usage example:

from requests import get, HTTPError, ConnectionError
from pydantic import BaseModel, AnyHttpUrl, ValidationError
    
class MyConfModel(BaseModel):
    URI: AnyHttpUrl

try:
    myAddress = MyConfModel(URI = "http://myurl.com/")
    req = get(myAddress.URI, verify=False)
    print(myAddress.URI)

except(ValidationError):
    print('Invalid destination')

Pydantic also raises exceptions (pydantic.ValidationError) that can be used to handle errors.

I have teste it with these patterns:

dxtr_brz
  • 43
  • 3
1

All of the above solutions recognize a string like "http://www.google.com/path,www.yahoo.com/path" as valid. This solution always works as it should

import re

# URL-link validation
ip_middle_octet = u"(?:\.(?:1?\d{1,2}|2[0-4]\d|25[0-5]))"
ip_last_octet = u"(?:\.(?:[1-9]\d?|1\d\d|2[0-4]\d|25[0-4]))"

URL_PATTERN = re.compile(
                        u"^"
                        # protocol identifier
                        u"(?:(?:https?|ftp|rtsp|rtp|mmp)://)"
                        # user:pass authentication
                        u"(?:\S+(?::\S*)?@)?"
                        u"(?:"
                        u"(?P<private_ip>"
                        # IP address exclusion
                        # private & local networks
                        u"(?:localhost)|"
                        u"(?:(?:10|127)" + ip_middle_octet + u"{2}" + ip_last_octet + u")|"
                        u"(?:(?:169\.254|192\.168)" + ip_middle_octet + ip_last_octet + u")|"
                        u"(?:172\.(?:1[6-9]|2\d|3[0-1])" + ip_middle_octet + ip_last_octet + u"))"
                        u"|"
                        # IP address dotted notation octets
                        # excludes loopback network 0.0.0.0
                        # excludes reserved space >= 224.0.0.0
                        # excludes network & broadcast addresses
                        # (first & last IP address of each class)
                        u"(?P<public_ip>"
                        u"(?:[1-9]\d?|1\d\d|2[01]\d|22[0-3])"
                        u"" + ip_middle_octet + u"{2}"
                        u"" + ip_last_octet + u")"
                        u"|"
                        # host name
                        u"(?:(?:[a-z\u00a1-\uffff0-9_-]-?)*[a-z\u00a1-\uffff0-9_-]+)"
                        # domain name
                        u"(?:\.(?:[a-z\u00a1-\uffff0-9_-]-?)*[a-z\u00a1-\uffff0-9_-]+)*"
                        # TLD identifier
                        u"(?:\.(?:[a-z\u00a1-\uffff]{2,}))"
                        u")"
                        # port number
                        u"(?::\d{2,5})?"
                        # resource path
                        u"(?:/\S*)?"
                        # query string
                        u"(?:\?\S*)?"
                        u"$",
                        re.UNICODE | re.IGNORECASE
                       )
def url_validate(url):   
    """ URL string validation
    """                                                                                                                                                      
    return re.compile(URL_PATTERN).match(url)
  • 2
    http://www.google.com/path,www.yahoo.com/path *is* valid. See [RFC 3986](https://tools.ietf.org/html/rfc3986): a `path` is made of `segment`s which are built from `pchar`s which may be `sub-delims` one of which is `","`. – Anders Kaseorg Aug 17 '20 at 18:25
  • Yes, the symbol "," is included in the list of acceptable sub-delims, but the line from my example, even in a terrible dream, cannot be a valid url =) – Сергей Дорофий Aug 19 '20 at 04:25
  • @СергейДорофий why not? If it is valid according to the grammar for an URI it is valid URI by definition, not sure I follow why you say it can't be valid if it contains valid characters. – Iwan Aucamp May 21 '22 at 10:27
1

Not directly relevant, but often it's required to identify whether some token CAN be a url or not, not necessarily 100% correctly formed (ie, https part omitted and so on). I've read this post and did not find the solution, so I am posting my own here for the sake of completeness.

def get_domain_suffixes():
    import requests
    res=requests.get('https://publicsuffix.org/list/public_suffix_list.dat')
    lst=set()
    for line in res.text.split('\n'):
        if not line.startswith('//'):
            domains=line.split('.')
            cand=domains[-1]
            if cand:
                lst.add('.'+cand)
    return tuple(sorted(lst))

domain_suffixes=get_domain_suffixes()

def reminds_url(txt:str):
    """
    >>> reminds_url('yandex.ru.com/somepath')
    True
    
    """
    ltext=txt.lower().split('/')[0]
    return ltext.startswith(('http','www','ftp')) or ltext.endswith(domain_suffixes)
Anatoly Alekseev
  • 2,011
  • 24
  • 27
  • 1
    I needed a stricter validator than what most answers implemented - correctly formed AND with a valid TDL. You answer gave me the neccessary second part, which I combined with a regex. Thank you. – lennihein May 01 '22 at 17:19
0

Use this example to conduct your own meaning of an "URL", and apply it everywhere in your code:

#         DO WHAT THE FUCK YOU WANT TO PUBLIC LICENSE
#                 Version 2, December 2004
#
# Copyright (C) 2004 Sam Hocevar <sam@hocevar.net>
#
# Everyone is permitted to copy and distribute verbatim or modified
# copies of this license document, and changing it is allowed as long
# as the name is changed.
#
#         DO WHAT THE FUCK YOU WANT TO PUBLIC LICENSE
# TERMS AND CONDITIONS FOR COPYING, DISTRIBUTION AND MODIFICATION
#
# 0. You just DO WHAT THE FUCK YOU WANT TO.
#
# Copyright © 2023 Anthony anthony@example.com
#
# This work is free. You can redistribute it and/or modify it under the
# terms of the Do What The Fuck You Want To Public License, Version 2,
# as published by Sam Hocevar. See the LICENSE file for more details.

import operator as op

from urllib.parse import (
    ParseResult,
    urlparse,
)

import attrs
import pytest

from phantom import Phantom
from phantom.fn import compose2


def is_url_address(value: str) -> bool:
    return any(urlparse(value))


class URL(str, Phantom, predicate=is_url_address):
    pass


# presume that an empty URL is a nonsense
def test_empty_url():
    with pytest.raises(TypeError, match="Could not parse .* from ''"):
        URL.parse("")


# is it enough now?
def test_url():
    assert URL.parse("http://")


scheme_and_netloc = op.attrgetter("scheme", "netloc")


def has_scheme_and_netloc(value: ParseResult) -> bool:
    return all(scheme_and_netloc(value))


# need a bit of FP magic  here
class ReachableURL(URL, predicate=compose2(has_scheme_and_netloc, urlparse)):
    pass


def test_empty_reachable_url():
    with pytest.raises(TypeError, match="Could not parse .* from ''"):
        ReachableURL.parse("")


# but "empty" for an URL is not just "empty string"
def test_reachable_url_probably_wrong_host():
    assert ReachableURL.parse("http://example")


def test_reachable_url():
    assert ReachableURL.parse("http://example.com")


def test_reachable_url_without_scheme():
    with pytest.raises(TypeError, match="Could not parse .* from 'example.com'"):
        ReachableURL.parse("example.com")


# constructor works too
def test_constructor():
    assert ReachableURL("http://example.com")


# but it *is* `str`
def test_url_is_str():
    assert isinstance(ReachableURL("http://example.com"), str)


# now we can write plain old classes utilizing our `URL` and `ReachableURL`

# I'm lazy...


@attrs.define
class Person:
    homepage: ReachableURL


def test_person():
    person = Person(homepage=ReachableURL("https://example.com/index.html"))

    assert person.homepage


def greet(person: Person) -> None:
    print(f"Hello! I will definitely visit you at {person.homepage}.")


if __name__ == "__main__":
    greet(Person(homepage=ReachableURL.parse("tg://resolve?username")))

It will not be surprising if an URL RFC turns out to be Turing-complete!

Anthony
  • 1,877
  • 17
  • 21
0

This code uses socket, so you don't need to install it, because it is a built in library. It tries to connect to the input url.

import socket

def isValid(url):
    #connect to the host -- tells us if the host is actually reachable
    try:
        socket.create_connection((url, 80))
        return True
    except socket.gaierror:
        return False
    except OSError:
        return False

A socket.gaierror occurs if the url is not valid, and an OSErrors occurs when you are not connected.

It returns True for both "https://www.google.com" and "google.com".

If it is a problem, you can simply use this code:

import socket

def isValid(url):
    if url.startswith("https://www.") or url.startswith("http://www."):
        try:
            socket.create_connection((url, 80))
            return True
        except socket.gaierror:
            return False
        except OSError:
            return False
    else:
        return False
-1
from urllib.parse import urlparse

def is_valid_url(url):
    try:
        result = urlparse(url)
        return all([result.scheme, result.netloc])
    except ValueError:
        return False

url = 'http://google.com'
if is_valid_url(url):
    print('Valid URL')
else:
    print('Malformed URL')
jmoerdyk
  • 5,544
  • 7
  • 38
  • 49
  • 1
    While this code may solve the question, [including an explanation](//meta.stackexchange.com/q/114762) of how and why this solves the problem would really help to improve the quality of your post, and probably result in more up-votes. Remember that you are answering the question for readers in the future, not just the person asking now. Please [edit] your answer to add explanations and give an indication of what limitations and assumptions apply. – Yunnosch Jan 23 '23 at 14:55
  • Your answer could be improved with additional supporting information. Please [edit] to add further details, such as citations or documentation, so that others can confirm that your answer is correct. You can find more information on how to write good answers [in the help center](/help/how-to-answer). – shamnad sherief Jan 24 '23 at 19:03
-2

Function based on Dominic Tarro answer:

import re
def is_url(x):
    return bool(re.match(
        r"(https?|ftp)://" # protocol
        r"(\w+(\-\w+)*\.)?" # host (optional)
        r"((\w+(\-\w+)*)\.(\w+))" # domain
        r"(\.\w+)*" # top-level domain (optional, can have > 1)
        r"([\w\-\._\~/]*)*(?<!\.)" # path, params, anchors, etc. (optional)
    , x))
pmiguelpinto90
  • 573
  • 9
  • 19