2

This is a link to the String in a linter.

And this is the Expression itself:

(?i)\b((?:https?:\/\/|www\d{0,3}[.]|[a-z0-9.\-]+[.][a-z]{2,4}\/)(?:[^\s()<>]+|\(([^\s()<>]+|(\([^\s()<>]+\)))*\))+(?:\(([^\s()<>]+|(\([^\s()<>]+\)))*\)|[^\s`!()\[\]{};:'\".,<>?«»“”‘’]))

I'm trying to validate almost ANY web url with this expression.

We can see here that it passes the unit tests as expected:

enter image description here

Yet as I said, when I try to run my code it seems to ignore validation...has me scratching my head.

These is the relevant portion of code:

//kindly taken from here: http://stackoverflow.com/a/34589895/2226328
function checkPageSpeed($url){    
    if (function_exists('file_get_contents')) {    
        $result = @file_get_contents($url);
    }   

    if ($result == '') {    
        $ch = curl_init();    
        $timeout = 60;    
        curl_setopt($ch, CURLOPT_URL, $url);
        curl_setopt($ch, CURLOPT_HEADER,1);//get the header
        curl_setopt($ch, CURLOPT_NOBODY,1);//and *only* get the header    
        curl_setopt($ch, CURLOPT_RETURNTRANSFER,1);//get the response as a string from curl_exec(), rather than echoing it
        curl_setopt($ch, CURLOPT_FOLLOWLOCATION, 1);  
        curl_setopt($ch, CURLOPT_SSL_VERIFYPEER, 0);  
        curl_setopt($ch, CURLOPT_SSL_VERIFYHOST, 0);  
        curl_setopt($ch, CURLOPT_CONNECTTIMEOUT, $timeout);
        curl_setopt($ch, CURLOPT_FRESH_CONNECT,1);//don't use a cached version of the url    

        $result = curl_exec($ch);    
        curl_close($ch);    
     }    
    return $result;    
}  

function pingGoogle($url){

    echo "<h1>".$url."</h1>";

    if(strtolower(substr($url, 0, 4)) !== "http") {
        echo "adding http:// to $url <br/>";
        $url = "http://".$url;
        echo "URL is now $url <br/>";
    } 

    //original idea from https://gist.github.com/dperini/729294
    $re = "/(?i)\\b((?:https?:\\/\\/|www\\d{0,3}[.]|[a-z0-9.\\-]+[.][a-z]{2,4}\\/)(?:[^\\s()<>]+|\\(([^\\s()<>]+|(\\([^\\s()<>]+\\)))*\\))+(?:\\(([^\\s()<>]+|(\\([^\\s()<>]+\\)))*\\)|[^\\s`!()\\[\\]{};:'\\\".,<>?«»“”‘’]))/"; 

    $test = preg_match($re, $url);  
    var_export($test);

    if( $test === 1) { 
        echo "$url passes pattern Test...let's check if it's actually valid ..."; 

        pingGoogle("hjm.google.cm/");
        pingGoogle("gamefaqs.com");
    }
    else 
    { 
        echo  "URL formatted proper but isn't an active URL! <br/>"; 
    }
}
Thomas Ayoub
  • 29,063
  • 15
  • 95
  • 142
Frankenmint
  • 1,570
  • 3
  • 18
  • 33
  • Could it be the === you have instead of == ? Please see here http://stackoverflow.com/questions/1117967/what-does-mean. Try changing it to == and see what happens. – Trev Davies May 18 '16 at 13:10
  • @Remuze same result still...also I get funy formatting issues from sublimetext...but Idk if it is actually just the program or an issue: http://puu.sh/oWlBI/fcef09bfc4.png – Frankenmint May 18 '16 at 13:20
  • I might be missing something here but why are there single backslashes in the screenshot and double backslashes in the PHP code? Doesn't `\\` escape a backslash? – Henders May 18 '16 at 13:26
  • not quite sure, they're not here anymore, I probably messed with it in the regex linter before pasting it back into my code – Frankenmint May 18 '16 at 13:34
  • You might want to update your code reflecting this :) – Henders May 18 '16 at 13:38
  • @Henders their code generation box gives me an escaped string we see...the last few passes I've been just pulling it from the builder page itself – Frankenmint May 18 '16 at 13:45
  • Ah, I'm with you now. Sorry, it's just if you paste the regex in your PHP code (`$re`) into regex101.com it gives me a bunch of errors like 'Unescaped delimiter' – Henders May 18 '16 at 13:48

2 Answers2

0

Holy moly that's a regex and a half...

Consider using parse_url to let PHP do the processing for you. Since you're only interested in the domain name, try:

$host = parse_url($url, PHP_URL_HOST);
if( $host === null) {
    echo "Failed to parse, no host found";
}
else {
    // do something with supposed host here
}
Niet the Dark Absol
  • 320,036
  • 81
  • 464
  • 592
  • meh...I think I'm settled on this `@((https?://)?([-\w]+\.[-\w\.]+)+\w(:\d+)?(/([-\w/_\.]*(\?\S+)?)?)*)@` to work...I don't really need to do what you suggest because I run that test anyway after parsing to see if a url looks like a url then it fails if it cannot connect...by the way...THIS is a regex an then some: https://gist.githubusercontent.com/dperini/729294/raw/ad7c8d5b4834f2de4c42e24d4a09b48c202da155/regex-weburl.js – Frankenmint May 18 '16 at 14:38
  • 1
    @Frankenmint i see your "regex and then some" and raise you [the regex in RFC822](http://www.ex-parrot.com/pdw/Mail-RFC822-Address.html) – ʰᵈˑ May 18 '16 at 17:02
0

Have you considered simply using PHP's built in validation filter, FILTER_VALIDATE_URL along with filter_var() for this? It is probably better than rolling your own regex-based solution both in terms of simplifying your code and in terms of performance.

http://php.net/manual/en/function.filter-var.php

http://php.net/manual/en/filter.filters.validate.php

Mike Brant
  • 70,514
  • 10
  • 99
  • 103