0

I need to check if a string is a valid URL, allowing wildcards in the name. For example, I've URLs sanitized as follows (lowercased, removed the path like example.com/path, added http:// or https://):

http://example.com

http://www.example.com

http://*.example.com

These URLs are VALID. Instead, I would mark as invalid URLs like:

http://example.*.com

http://example

http://*.it
(and obviously any not-valid URL, so remove invalid characters etc.)

Anyone can help? I've tried many regex, but no one works..

My pattern should be something like: [http://].[*|a-z|www].[a-z].[tld] (allowing also fourth-level domains!)

Thanks in advance!

Wouter J
  • 41,455
  • 15
  • 107
  • 112
Carmine Giangregorio
  • 943
  • 2
  • 14
  • 35
  • You may want to look here: http://stackoverflow.com/a/4694816/181151 I don't think it includes wildcard checks, but you should be able to add that. – Chris Rasco Dec 10 '13 at 19:14
  • http://stackoverflow.com/questions/3653884/php-regex-for-url-validation-filter-var-is-too-permisive?rq=1 – scrowler Dec 10 '13 at 19:18
  • You may want to think about what constitutes a valid domain name. e.g. `uz.` is valid and resolves to an IP address. – Rob Farr Dec 10 '13 at 19:19
  • @ChrisRasco I think that solution is the best; but how can I add the wildcard check only for a single wildcard and not in the second-level and the top-level domain? RobFarr I know that uz. is a valid URL, but I need to validate only the patterns I wrote, that is only second/third level domains with a possibile wildcard on the third level. – Carmine Giangregorio Dec 10 '13 at 20:52
  • @CarmineGiangregorio The wildcard check might be best done in code, actually. After regex validation (using a regex that allows wildcards), use a library or function to parse the URL, then check to make sure that any wildcard present is only in the first part of the domain name. – Wayne Conrad Mar 07 '14 at 16:19

1 Answers1

3

Regex are tricky, but here is what I came up with:

<?php
function is_valid_domain_name($domain_name,&$matches)
{
    return (preg_match("/^(\*\.)?([a-z\d](-*[a-z\d])*)(\.([a-z\d](-*[a-z\d])*))*$/i", $domain_name,$matches) //valid chars check
            && preg_match("/^.{1,253}$/", $domain_name) //overall length check
            && preg_match("/^[^\.]{1,63}(\.[^\.]{1,63})*$/", $domain_name)   ); //length of each label
}

I ran through the following test code:

<?php
$domains = Array('a',0,'a.b','localhost','google.com','*.example.com','news.google.co.uk','xn--fsqu00a.xn--0zwm56d','goo google.com','google..com','google.com ','google-.com','.google.com');

echo "/^(\*\.)?([a-z\d](-*[a-z\d])*)(\.([a-z\d](-*[a-z\d])*))*$/i";
foreach($domains as $domain)
{
    echo $domain . ' - ';
    echo is_valid_domain_name($domain,$matches) ? "VALID\n" : "NOT VALID\n";
    print_r($matches);
}

Here was my output:

/^(\*\.)?([a-z\d](-*[a-z\d])*)(\.([a-z\d](-*[a-z\d])*))*$/ia - VALID
Array
(
    [0] => a
    [1] => 
    [2] => a
)
0 - VALID
Array
(
    [0] => 0
    [1] => 
    [2] => 0
)
a.b - VALID
Array
(
    [0] => a.b
    [1] => 
    [2] => a
    [3] => 
    [4] => .b
    [5] => b
)
localhost - VALID
Array
(
    [0] => localhost
    [1] => 
    [2] => localhost
    [3] => t
)
google.com - VALID
Array
(
    [0] => google.com
    [1] => 
    [2] => google
    [3] => e
    [4] => .com
    [5] => com
    [6] => m
)
*.example.com - VALID
Array
(
    [0] => *.example.com
    [1] => *.
    [2] => example
    [3] => e
    [4] => .com
    [5] => com
    [6] => m
)
news.google.co.uk - VALID
Array
(
    [0] => news.google.co.uk
    [1] => 
    [2] => news
    [3] => s
    [4] => .uk
    [5] => uk
    [6] => k
)
xn--fsqu00a.xn--0zwm56d - VALID
Array
(
    [0] => xn--fsqu00a.xn--0zwm56d
    [1] => 
    [2] => xn--fsqu00a
    [3] => a
    [4] => .xn--0zwm56d
    [5] => xn--0zwm56d
    [6] => d
)
goo google.com - NOT VALID
Array
(
)
google..com - NOT VALID
Array
(
)
google.com  - NOT VALID
Array
(
)
google-.com - NOT VALID
Array
(
)
.google.com - NOT VALID
Array
(
)

I included the optional $matches parameter to preg_match so I could see where the regex was matching what string.

Your final code will probably be:

<?php
function is_valid_domain_name($domain_name)
{
    return (preg_match("/^(\*\.)?([a-z\d](-*[a-z\d])*)(\.([a-z\d](-*[a-z\d])*))*$/i", $domain_name) //valid chars check
            && preg_match("/^.{1,253}$/", $domain_name) //overall length check
            && preg_match("/^[^\.]{1,63}(\.[^\.]{1,63})*$/", $domain_name)   ); //length of each label
}

UPDATE: Making TLDs NOT valid

<?php
function is_valid_domain_name($domain_name)
{
    return (preg_match("/^(\*\.)?([a-z\d](-*[a-z\d])*)(\.([a-z\d](-*[a-z\d])*))+$/i", $domain_name) //valid chars check
            && preg_match("/^.{1,253}$/", $domain_name) //overall length check
            && preg_match("/^[^\.]{1,63}(\.[^\.]{1,63})*$/", $domain_name)   ); //length of each label
}

DISCLAIMER: REGEX is super tricky so you use this at your own risk. :)

Chris Rasco
  • 2,713
  • 1
  • 18
  • 22