17

There is question by the almost the same name already: What is the best regular expression to check if a string is a valid URL

I don't understand this stackoverflow. It seems like I need reputation to comment an answer. As I don't have it, I don't know how to tell/ask that the proposed solution doesn't seem to work. So I'm forced to make a new question and ask for the solution this way?

UPDATE: So it seems that that Reg Exp supports IPV6 and I was to blame as the IPv6 is supposed to go like http://[2620:0:1cfe:face:b00c::3]/.

So only real problem I know with that now is, that it accepts example.org: as valid URL.

Or is PHP to blame?

/**
  * Validate URL - RFC 3987 (IRI)
  *
  * https://stackoverflow.com/questions/161738/what-is-the-best-regular-expression-to-check-if-a-string-is-a-valid-url
  *
  * @param string $str_url
  * @return boolean
  */
 function is_url($str_url)
 {
  // RFC 3987 For absolute IRIs (internationalized):
  return (bool) preg_match('/^[a-z](?:[-a-z0-9\+\.])*:(?:\/\/(?:(?:%[0-9a-f][0-9a-f]|[-a-z0-9\._~\x{A0}-\x{D7FF}\x{F900}-\x{FDCF}\x{FDF0}-\x{FFEF}\x{10000}-\x{1FFFD}\x{20000}-\x{2FFFD}\x{30000}-\x{3FFFD}\x{40000}-\x{4FFFD}\x{50000}-\x{5FFFD}\x{60000}-\x{6FFFD}\x{70000}-\x{7FFFD}\x{80000}-\x{8FFFD}\x{90000}-\x{9FFFD}\x{A0000}-\x{AFFFD}\x{B0000}-\x{BFFFD}\x{C0000}-\x{CFFFD}\x{D0000}-\x{DFFFD}\x{E1000}-\x{EFFFD}!\$&\'\(\)\*\+,;=:])*@)?(?:\[(?:(?:(?:[0-9a-f]{1,4}:){6}(?:[0-9a-f]{1,4}:[0-9a-f]{1,4}|(?:[0-9]|[1-9][0-9]|1[0-9][0-9]|2[0-4][0-9]|25[0-5])(?:\.(?:[0-9]|[1-9][0-9]|1[0-9][0-9]|2[0-4][0-9]|25[0-5])){3})|::(?:[0-9a-f]{1,4}:){5}(?:[0-9a-f]{1,4}:[0-9a-f]{1,4}|(?:[0-9]|[1-9][0-9]|1[0-9][0-9]|2[0-4][0-9]|25[0-5])(?:\.(?:[0-9]|[1-9][0-9]|1[0-9][0-9]|2[0-4][0-9]|25[0-5])){3})|(?:[0-9a-f]{1,4})?::(?:[0-9a-f]{1,4}:){4}(?:[0-9a-f]{1,4}:[0-9a-f]{1,4}|(?:[0-9]|[1-9][0-9]|1[0-9][0-9]|2[0-4][0-9]|25[0-5])(?:\.(?:[0-9]|[1-9][0-9]|1[0-9][0-9]|2[0-4][0-9]|25[0-5])){3})|(?:[0-9a-f]{1,4}:[0-9a-f]{1,4})?::(?:[0-9a-f]{1,4}:){3}(?:[0-9a-f]{1,4}:[0-9a-f]{1,4}|(?:[0-9]|[1-9][0-9]|1[0-9][0-9]|2[0-4][0-9]|25[0-5])(?:\.(?:[0-9]|[1-9][0-9]|1[0-9][0-9]|2[0-4][0-9]|25[0-5])){3})|(?:(?:[0-9a-f]{1,4}:){0,2}[0-9a-f]{1,4})?::(?:[0-9a-f]{1,4}:){2}(?:[0-9a-f]{1,4}:[0-9a-f]{1,4}|(?:[0-9]|[1-9][0-9]|1[0-9][0-9]|2[0-4][0-9]|25[0-5])(?:\.(?:[0-9]|[1-9][0-9]|1[0-9][0-9]|2[0-4][0-9]|25[0-5])){3})|(?:(?:[0-9a-f]{1,4}:){0,3}[0-9a-f]{1,4})?::[0-9a-f]{1,4}:(?:[0-9a-f]{1,4}:[0-9a-f]{1,4}|(?:[0-9]|[1-9][0-9]|1[0-9][0-9]|2[0-4][0-9]|25[0-5])(?:\.(?:[0-9]|[1-9][0-9]|1[0-9][0-9]|2[0-4][0-9]|25[0-5])){3})|(?:(?:[0-9a-f]{1,4}:){0,4}[0-9a-f]{1,4})?::(?:[0-9a-f]{1,4}:[0-9a-f]{1,4}|(?:[0-9]|[1-9][0-9]|1[0-9][0-9]|2[0-4][0-9]|25[0-5])(?:\.(?:[0-9]|[1-9][0-9]|1[0-9][0-9]|2[0-4][0-9]|25[0-5])){3})|(?:(?:[0-9a-f]{1,4}:){0,5}[0-9a-f]{1,4})?::[0-9a-f]{1,4}|(?:(?:[0-9a-f]{1,4}:){0,6}[0-9a-f]{1,4})?::)|v[0-9a-f]+[-a-z0-9\._~!\$&\'\(\)\*\+,;=:]+)\]|(?:[0-9]|[1-9][0-9]|1[0-9][0-9]|2[0-4][0-9]|25[0-5])(?:\.(?:[0-9]|[1-9][0-9]|1[0-9][0-9]|2[0-4][0-9]|25[0-5])){3}|(?:%[0-9a-f][0-9a-f]|[-a-z0-9\._~\x{A0}-\x{D7FF}\x{F900}-\x{FDCF}\x{FDF0}-\x{FFEF}\x{10000}-\x{1FFFD}\x{20000}-\x{2FFFD}\x{30000}-\x{3FFFD}\x{40000}-\x{4FFFD}\x{50000}-\x{5FFFD}\x{60000}-\x{6FFFD}\x{70000}-\x{7FFFD}\x{80000}-\x{8FFFD}\x{90000}-\x{9FFFD}\x{A0000}-\x{AFFFD}\x{B0000}-\x{BFFFD}\x{C0000}-\x{CFFFD}\x{D0000}-\x{DFFFD}\x{E1000}-\x{EFFFD}!\$&\'\(\)\*\+,;=@])*)(?::[0-9]*)?(?:\/(?:(?:%[0-9a-f][0-9a-f]|[-a-z0-9\._~\x{A0}-\x{D7FF}\x{F900}-\x{FDCF}\x{FDF0}-\x{FFEF}\x{10000}-\x{1FFFD}\x{20000}-\x{2FFFD}\x{30000}-\x{3FFFD}\x{40000}-\x{4FFFD}\x{50000}-\x{5FFFD}\x{60000}-\x{6FFFD}\x{70000}-\x{7FFFD}\x{80000}-\x{8FFFD}\x{90000}-\x{9FFFD}\x{A0000}-\x{AFFFD}\x{B0000}-\x{BFFFD}\x{C0000}-\x{CFFFD}\x{D0000}-\x{DFFFD}\x{E1000}-\x{EFFFD}!\$&\'\(\)\*\+,;=:@]))*)*|\/(?:(?:(?:(?:%[0-9a-f][0-9a-f]|[-a-z0-9\._~\x{A0}-\x{D7FF}\x{F900}-\x{FDCF}\x{FDF0}-\x{FFEF}\x{10000}-\x{1FFFD}\x{20000}-\x{2FFFD}\x{30000}-\x{3FFFD}\x{40000}-\x{4FFFD}\x{50000}-\x{5FFFD}\x{60000}-\x{6FFFD}\x{70000}-\x{7FFFD}\x{80000}-\x{8FFFD}\x{90000}-\x{9FFFD}\x{A0000}-\x{AFFFD}\x{B0000}-\x{BFFFD}\x{C0000}-\x{CFFFD}\x{D0000}-\x{DFFFD}\x{E1000}-\x{EFFFD}!\$&\'\(\)\*\+,;=:@]))+)(?:\/(?:(?:%[0-9a-f][0-9a-f]|[-a-z0-9\._~\x{A0}-\x{D7FF}\x{F900}-\x{FDCF}\x{FDF0}-\x{FFEF}\x{10000}-\x{1FFFD}\x{20000}-\x{2FFFD}\x{30000}-\x{3FFFD}\x{40000}-\x{4FFFD}\x{50000}-\x{5FFFD}\x{60000}-\x{6FFFD}\x{70000}-\x{7FFFD}\x{80000}-\x{8FFFD}\x{90000}-\x{9FFFD}\x{A0000}-\x{AFFFD}\x{B0000}-\x{BFFFD}\x{C0000}-\x{CFFFD}\x{D0000}-\x{DFFFD}\x{E1000}-\x{EFFFD}!\$&\'\(\)\*\+,;=:@]))*)*)?|(?:(?:(?:%[0-9a-f][0-9a-f]|[-a-z0-9\._~\x{A0}-\x{D7FF}\x{F900}-\x{FDCF}\x{FDF0}-\x{FFEF}\x{10000}-\x{1FFFD}\x{20000}-\x{2FFFD}\x{30000}-\x{3FFFD}\x{40000}-\x{4FFFD}\x{50000}-\x{5FFFD}\x{60000}-\x{6FFFD}\x{70000}-\x{7FFFD}\x{80000}-\x{8FFFD}\x{90000}-\x{9FFFD}\x{A0000}-\x{AFFFD}\x{B0000}-\x{BFFFD}\x{C0000}-\x{CFFFD}\x{D0000}-\x{DFFFD}\x{E1000}-\x{EFFFD}!\$&\'\(\)\*\+,;=:@]))+)(?:\/(?:(?:%[0-9a-f][0-9a-f]|[-a-z0-9\._~\x{A0}-\x{D7FF}\x{F900}-\x{FDCF}\x{FDF0}-\x{FFEF}\x{10000}-\x{1FFFD}\x{20000}-\x{2FFFD}\x{30000}-\x{3FFFD}\x{40000}-\x{4FFFD}\x{50000}-\x{5FFFD}\x{60000}-\x{6FFFD}\x{70000}-\x{7FFFD}\x{80000}-\x{8FFFD}\x{90000}-\x{9FFFD}\x{A0000}-\x{AFFFD}\x{B0000}-\x{BFFFD}\x{C0000}-\x{CFFFD}\x{D0000}-\x{DFFFD}\x{E1000}-\x{EFFFD}!\$&\'\(\)\*\+,;=:@]))*)*|(?!(?:%[0-9a-f][0-9a-f]|[-a-z0-9\._~\x{A0}-\x{D7FF}\x{F900}-\x{FDCF}\x{FDF0}-\x{FFEF}\x{10000}-\x{1FFFD}\x{20000}-\x{2FFFD}\x{30000}-\x{3FFFD}\x{40000}-\x{4FFFD}\x{50000}-\x{5FFFD}\x{60000}-\x{6FFFD}\x{70000}-\x{7FFFD}\x{80000}-\x{8FFFD}\x{90000}-\x{9FFFD}\x{A0000}-\x{AFFFD}\x{B0000}-\x{BFFFD}\x{C0000}-\x{CFFFD}\x{D0000}-\x{DFFFD}\x{E1000}-\x{EFFFD}!\$&\'\(\)\*\+,;=:@])))(?:\?(?:(?:%[0-9a-f][0-9a-f]|[-a-z0-9\._~\x{A0}-\x{D7FF}\x{F900}-\x{FDCF}\x{FDF0}-\x{FFEF}\x{10000}-\x{1FFFD}\x{20000}-\x{2FFFD}\x{30000}-\x{3FFFD}\x{40000}-\x{4FFFD}\x{50000}-\x{5FFFD}\x{60000}-\x{6FFFD}\x{70000}-\x{7FFFD}\x{80000}-\x{8FFFD}\x{90000}-\x{9FFFD}\x{A0000}-\x{AFFFD}\x{B0000}-\x{BFFFD}\x{C0000}-\x{CFFFD}\x{D0000}-\x{DFFFD}\x{E1000}-\x{EFFFD}!\$&\'\(\)\*\+,;=:@])|[\x{E000}-\x{F8FF}\x{F0000}-\x{FFFFD}|\x{100000}-\x{10FFFD}\/\?])*)?(?:\#(?:(?:%[0-9a-f][0-9a-f]|[-a-z0-9\._~\x{A0}-\x{D7FF}\x{F900}-\x{FDCF}\x{FDF0}-\x{FFEF}\x{10000}-\x{1FFFD}\x{20000}-\x{2FFFD}\x{30000}-\x{3FFFD}\x{40000}-\x{4FFFD}\x{50000}-\x{5FFFD}\x{60000}-\x{6FFFD}\x{70000}-\x{7FFFD}\x{80000}-\x{8FFFD}\x{90000}-\x{9FFFD}\x{A0000}-\x{AFFFD}\x{B0000}-\x{BFFFD}\x{C0000}-\x{CFFFD}\x{D0000}-\x{DFFFD}\x{E1000}-\x{EFFFD}!\$&\'\(\)\*\+,;=:@])|[\/\?])*)?$/iu',$str_url);
 }

Here is the test for it:

$urls=array('http://www.example.org/','http://www.example.org:80/','example.org','ftp://user:pass@example.org/','http://example.org/?cat=5&test=joo','http://www.fi/?cat=5&test=joo','http://[::1]/','http://[2620:0:1cfe:face:b00c::3]/','http://[2620:0:1cfe:face:b00c::3]:80/','');
foreach ($urls as $a)
{
    echo $a."\n";
    $a=is_url($a);
    var_dump($a);
}

And that outputs:

"http://www.example.org/" bool(true)
"http://www.example.org:80/" bool(true)
"example.org" bool(false)
"ftp://user:pass@example.org/" bool(true)
"http://example.org/?cat=5&test=joo" bool(true)
"http://www.fi/?cat=5&test=joo" bool(true) 
"http://[::1]/" bool(true)
"http://[2620:0:1cfe:face:b00c::3]/" bool(true)
"http://[2620:0:1cfe:face:b00c::3]:80/" bool(true)
"" bool(false)

So what is the RFC compilicant and working regexp?

Yennefer
  • 5,704
  • 7
  • 31
  • 44
jmto
  • 291
  • 3
  • 7
  • +1 for a useful question, in my opinion you're acting correcty by asking a new question since it does differ in the other one (which is obviously for IPv4 only). You'll get more rep soon if you post more here at SO, and the low-rep barriers is part of what keeps the quality high here. ;-) – Lucero Jan 17 '11 at 12:41
  • 3
    You are looking for absolute URIs, aren’t you? Because even an empty string is a valid URI reference. – Gumbo Jan 17 '11 at 12:42
  • 1
    Your IPv6 example is not correct, it should be `http://[2620:0:1cfe:face:b00c::3]:80/` so parsers can differentiate between the hex delimiters and optional :80 port number. – mario Jan 17 '11 at 12:51
  • Ok, I'll comment here. Yes my mistake in IPv6 syntax. Those works, so really the problem is with example.org: beging valid one. All those urls breaks the editing in here, and I get some Oauth stuff there if I try to edit my post, so I'm not going to edit it. Oh that OAuth stuff is already there... – jmto Jan 17 '11 at 13:46
  • Well, Ok, looking RFC it seems that scheme is allowed to have dots in it. So basicly "example.org:" is valid according to RFC. – jmto Jan 17 '11 at 14:28

4 Answers4

4

Well, if you look at it, the specification is broken down into "chunks". That's how I'd suggest building the regex so that it's easier to read, more maintainable and understandable. So, the parts of the regex are (Optional are italicized):

  1. Scheme
  2. Username/Password
  3. Domain Or IP Address
  4. Port
  5. Path
  6. Query
  7. Anchor

So, we need to build a regex sub-part for each.

  1. Scheme:

    $scheme = "[a-z][a-z0-9+.-]*";
    
  2. Username/Password:

    $username = "([^:@/](:[^:@/])?@)?";
    
  3. Domain or IP Address:

    Now, we need to build up the 3 possible hosts:

    1. Domain Name
    2. IPv4
    3. IPv6

    Domain Name:

    $segment = "([a-z][a-z0-9-]*?[a-z0-9])";
    $domain = "({$segment}\.)*{$segment}";
    

    IPv4:

    $segment = "([0|1][0-9]{2}|2([0-4][0-9]|5[0-5]))";
    $ipv4 = "({$segment}\.{$segment}\.{$segment}\.{$segment})";
    

    IPv6:

    $block = "([a-f0-9]{0,4})";
    $rawIpv6 = "({$block}:){2,8}";
    $ipv4sub = "(::ffff:{$ipv4})";
    $ipv6 = "([({$rawIpv6}|{$ipv4sub})])";
    

    Finally:

    $host = "($domain|$ipv4|$ipv6)";
    
  4. Port:

    $port = "(:[\d]{1,5})?";
    
  5. Path:

    $path = "([^?;\#]*)?";
    
  6. Query:

    $query = "(\?[^\#;]*)?";
    
  7. Anchor:

    $anchor = "(\#.*)?";
    

And the final regex:

$regex = "#^{$scheme}://{$username}{$host}{$port}(/{$path}{$query}{$anchor}|)$#i";

Note that the / is in the regex, and not the path part since path can be empty.

Also note that I have not tested this. It should work, but definitely it needs confirming that each part is correct (as for what to expect in the url).

Also also note that this is only one way of doing it. You could use other tools that don't need regexp or a library or framework that'll be easier to maintain in the long run.

Best of luck

ircmaxell
  • 163,128
  • 34
  • 264
  • 314
  • 1
    Quick look seems like that will fail on many IPv6 addresses. IPv6 validation can't be so simple. How about http://[::1]/ or http://[2620:0:1cfe:face:b00c::3]/? As it's kind of normal that IPv6 address is written more shortly. – jmto Jan 17 '11 at 14:21
3

After reading RFC 3986, I have to say I was wrong. That regexp is fully working (that I know). First mistake I had was syntax of IPv6 addresesses, they are put around [], and second was about example.org: (note trailing double dot :). But as the RFC says scheme can have dots in it, so it's also valid.

So that's valid RFC way to do it, but people will usually (as I will) need to modify it to only accept some schemas.

jmto
  • 291
  • 3
  • 7
0

Thanks ircmaxell but I had to adjust a little the IPV6 regex for PHP to compile with preg_match.

I changed:

$ipv6 = "([({$rawIpv6}|{$ipv4sub})])";

To :

$ipv6 = "({$rawIpv6}|{$ipv4sub})";
bensiu
  • 24,660
  • 56
  • 77
  • 117
0

Here's RFC that you can study: RFC 3986 - Uniform Resource Identifier (URI): Generic Syntax. Section 3.2.2 Host is what you're looking for.

Unfortunately PHP's build-in function filter_var() doesn't support IPv6 syntax:

<?php

var_dump(filter_var('http://[2620:0:1cfe:face:b00c::3]:80/', FILTER_VALIDATE_URL));
// Output: boolean false
Community
  • 1
  • 1
Crozin
  • 43,890
  • 13
  • 88
  • 135
  • Well, I'm not looking a way to validate url in PHP, but to replace url's to a href links. Oh, but that IPv6 is supposed to be in [], will help. Well then that RegExp works with those, but fails on that 'example.org:' syntax. – jmto Jan 17 '11 at 13:40
  • I agree, example.org isn't valid, neither is example.org: (note that trailing double dot). But that regexp in the question says that it's valid: example.org bool(false) example.org: bool(true) – jmto Jan 17 '11 at 14:19
  • I looked how PHP does that. http://svn.php.net/viewvc/php/php-src/trunk/ext/filter/logical_filters.c?view=markup - /* Use parse_url - if it returns false, we return NULL */ http://pl.php.net/manual/en/function.parse-url.php - "This function is not meant to validate the given URL, it only breaks it up into the above listed parts. Partial URLs are also accepted, parse_url() tries its best to parse them correctly." Funny PHP, it has a filter for VALIDATE_URL but it uses it's own function which is not ment to validate url! :D – jmto Jan 18 '11 at 09:52