3

I am checking for url & return "valid" if url status code "200" & "invalid" if its on "404",

urls are links which redirect to a certain page (url) & i need to check that page (url) status to determine if its valid or invalid on the basis of its status code.

<?php

// From URL to get redirected URL
$url = 'https://www.shareasale.com/m-pr.cfm?merchantID=83483&userID=1860618&productID=916465625';
  
// Initialize a CURL session.
$ch = curl_init();
  
// Grab URL and pass it to the variable.
curl_setopt($ch, CURLOPT_URL, $url);
  
// Catch output (do NOT print!)
curl_setopt($ch, CURLOPT_RETURNTRANSFER, TRUE);
  
// Return follow location true
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, TRUE);
$html = curl_exec($ch);
  
// Getinfo or redirected URL from effective URL
$redirectedUrl = curl_getinfo($ch, CURLINFO_EFFECTIVE_URL);
  
// Close handle
curl_close($ch);
echo "Original URL:   " . $url . "<br/> </br>";
echo "Redirected URL: " . $redirectedUrl . "<br/>";

 function is_url_valid($url) {
  $handle = curl_init($url);
  curl_setopt($handle, CURLOPT_RETURNTRANSFER, true);
  curl_setopt($handle, CURLOPT_NOBODY, true);
  curl_exec($handle);
 
  $httpCode = intval(curl_getinfo($handle, CURLINFO_HTTP_CODE));
  curl_close($handle);
 
  if ($httpCode == 200) {
    return 'valid link';
  }
  else {
    return 'invalid link';
  }
}

// 
echo "<br/>".is_url_valid($redirectedUrl)."<br/>";

As you can see the above link has status 400 still it shows "valid" I am using above code, any thoughts or correction's ? in order to make it work as expected ? It seems like the site has more then one redirected url & script checks for only one that's why it shows valid. any thoughts how to fix it ?

Here are the links which i am checking with

ISSUE -

FOR EXAMPLE - If i check with this link https://www.shareasale.com/m-pr.cfm?merchantID=66802&userID=1860618&productID=1186005518 then in browser it goes on "404" but in script o/p its "200"

  • The above link has Status Code: 302 & redirected to new url which has status code 200, i want to check the end url (last url). –  Jul 03 '21 at 05:18
  • `$httpCode = intval(curl_getinfo($handle, CURLINFO_HTTP_CODE));` - just to be safe make sure it's an integer for your comparison – Kinglish Jul 03 '21 at 05:20
  • Thanks for the comment & suggestion, though i am getting 404 as status code in output –  Jul 03 '21 at 05:22
  • Does https://stackoverflow.com/questions/2280394/how-can-i-check-if-a-url-exists-via-php help? – Nigel Ren Jul 03 '21 at 06:17
  • No it does not i have checked that –  Jul 03 '21 at 06:29
  • The issue is i am not getting the final url from a url which has multiple redirections. if i get the final url i can check with the link you have provided, that would work –  Jul 03 '21 at 06:30
  • any thoughts on this ? –  Jul 05 '21 at 06:47
  • 2
    @devhs - I am not sure if it is proper solution or not. But I checked some of the above links, they are managing custom page for 404. As a quick solution, you can get the contents of the URL with "file_get_contents" and check the "Page Title". – Sachin Vairagi Jul 05 '21 at 09:26
  • @SachinVairagi Thanks for the suggestion, but i need to get final url first then i can use "file_get_contents" on it. So if i get final url then there are few ways to determine. –  Jul 05 '21 at 09:45
  • You'll left with some cases where you'll be unable to find the response code of final url because of `Refresh` header. In the past I had kinda same requirement but to fetch the og tags from the final url and ended up leaving some corner cases – Haridarshan Jul 05 '21 at 10:47
  • yes i think, can you elaborate "because of Refresh header" ? –  Jul 05 '21 at 10:51
  • @devhs: meta refresh http-equiv header. it is not within the headers of the http response message but within the body, if hypertext. that is when you go with the browser there and have automatic redirects enabled, you're being redirected. and I can not imagine that your question is not yet available and answered multiple times on SO, honestly. if I remember from https://hakre.wordpress.com/2011/09/17/head-first-with-php-streams/ another one was https://stackoverflow.com/q/981954/367456 . But your main problem is not using the browser but a different client (curl), search that, many answers – hakre Jul 05 '21 at 12:26
  • then in browser it goes on "404" but in script o/p its "200" it's because maybe have iclusion protection on from cpanel or php custom script (like me) to avoid server's leaks and\or pass file inclusion attacks –  Jul 08 '21 at 14:48
  • 1
    By "Refresh" header, I mean `header("Refresh:5; url=page2.php");` in this case `curl_setopt($ch, CURLOPT_FOLLOWLOCATION, TRUE);` doesn't follow redirections, another is **meta refresh http-equiv header** and javascript redirects – Haridarshan Jul 09 '21 at 13:02
  • I am not really sure to award bounty to which answer, let stackoverflow decide it. Thanks –  Jul 13 '21 at 04:47

5 Answers5

2

I use the get_headers() function for this. If I found a status 2xx in the array then the URL is ok.

function urlExists($url){
  $headers = @get_headers($url);
  if($headers === false) return false;
  return preg_grep('~^HTTP/\d+\.\d+\s+2\d{2}~',$headers) ? true : false;
}
jspit
  • 7,276
  • 1
  • 9
  • 17
  • Thanks for the answer, but what if the main url has redirections (multiple redirections) ? Suppose this url - https://www.shareasale.com/m-pr.cfm?merchantID=66802&userID=1860618&productID=1186005518 –  Jul 05 '21 at 09:58
  • The function returns true for this URL. Is that ok? – jspit Jul 05 '21 at 10:04
  • No its not as page's status code is 404 (not found) so it should not return true –  Jul 05 '21 at 10:06
  • I don't get a 404 status. I get 0 => "HTTP / 1.1 302 Found" with var_export (get_headers ($ url)); – jspit Jul 05 '21 at 10:14
  • Just visit https://www.shareasale.com/m-pr.cfm?merchantID=66802&userID=1860618&productID=1186005518 url in browser & the page will redirected to a page which is on 404 not found & its status code in network tab also would be 404 –  Jul 05 '21 at 10:15
  • 2
    I don't get an ad if Javascript is deactivated in my browser. I think this forwarding is done via javascript. This problem cannot be solved with PHP alone. – jspit Jul 05 '21 at 11:00
  • That's absolutely right, any quick solution to implement this ? what you say ? –  Jul 05 '21 at 11:06
  • 1
    I don't have a quick fix. – jspit Jul 05 '21 at 11:33
  • Well, thanks for the answer & suggestion. –  Jul 05 '21 at 11:35
1

This is my take on this issue. Basically, the takeaway is:

  1. You didn't need to make more than one request. Using CURLOPT_FOLLOWLOCATION will do all the job for you, and in the end, the http response code that you will get is the one from the final call in case of a/some redirection(s).
  2. Since you are using CURLOPT_NOBODY, the request will use a HEAD method and will not return anything. For that reason, CURLOPT_RETURNTRANSFER is useless.
  3. I have taken the liberty of using my own coding style (no offence).
  4. Since I was running the code from a Phpstorm's Scratch file, I have added some PHP_EOL as line breaks to format the output. Feel free to remove them. 

...  

<?php

$linksToCheck = [
    'https://click.linksynergy.com/link?id=GsILx6E5APM&offerid=547531.5112&type=15&murl=https%3A%2F%2Fwww.peopletree.co.uk%2Fwomen%2Fdresses%2Fanna-checked-dress',
    'https://click.linksynergy.com/link?id=GsILx6E5APM&offerid=330522.2335&type=15&murl=https%3A%2F%2Fwww.wearethought.com%2Fagnetha-black-floral-print-bamboo-dress-midnight-navy%2F%2392%3D1390%26142%3D198',
    'https://click.linksynergy.com/link?id=GsILx6E5APM&offerid=330522.752&type=15&murl=https%3A%2F%2Fwww.wearethought.com%2Fbernice-floral-tunic-dress%2F%2392%3D1273%26142%3D198',
    'https://click.linksynergy.com/link?id=GsILx6E5APM&offerid=330522.6863&type=15&murl=https%3A%2F%2Fwww.wearethought.com%2Fjosefa-smock-shift-dress-in-midnight-navy-hemp%2F%2392%3D1390%26142%3D208',
    'https://www.shareasale.com/m-pr.cfm?merchantID=16570&userID=1860618&productID=546729471',
    'https://www.shareasale.com/m-pr.cfm?merchantID=53661&userID=1860618&productID=680698793',
    'https://www.shareasale.com/m-pr.cfm?merchantID=66802&userID=1860618&productID=1186005518',
    'https://www.shareasale.com/m-pr.cfm?merchantID=83483&userID=1860618&productID=916465625',
];

function isValidUrl($url) {
    echo "Original URL:   " . $url . "<br/>\n";

    $handle = curl_init($url);

    // Follow any redirection.
    curl_setopt($handle, CURLOPT_FOLLOWLOCATION, TRUE);

    // Use a HEAD request and do not return a body.
    curl_setopt($handle, CURLOPT_NOBODY, true);

    // Execute the request.
    curl_exec($handle);

    // Get the effective URL.
    $effectiveUrl = curl_getinfo($handle, CURLINFO_EFFECTIVE_URL);
    echo "Effective URL:   " . $effectiveUrl . "<br/> </br>";

    $httpResponseCode = (int) curl_getinfo($handle, CURLINFO_HTTP_CODE);

    // Close this request.
    curl_close($handle);

    if ($httpResponseCode == 200) {
        return '✅';
    }
    else {
        return '❌';
    }
}

foreach ($linksToCheck as $linkToCheck) {
    echo PHP_EOL . "Result: " . isValidUrl($linkToCheck) . PHP_EOL . PHP_EOL;
}
asiby
  • 3,229
  • 29
  • 32
  • haha cool use of utf8! unfortunately OP want to follow ***javascript*** redirects as well, see my answer below for info :( – hanshenrik Jul 06 '21 at 14:37
1

Note: We have used CURLOPT_NOBODY to just check for the connection and not to fetch the whole body.

  $url = "Your URL";
  $curl = curl_init($url);
  curl_setopt($curl, CURLOPT_NOBODY, true);
  $result = curl_exec($curl);
 if ($result !== false)
 {
    $statusCode = curl_getinfo($curl, CURLINFO_HTTP_CODE);  
 if ($statusCode == 404)
 {
   echo "URL Not Exists"
 }
 else
 {
   echo "URL Exists";
  }
 }
else
{
  echo "URL not Exists";
}
0

The below code works well but when i put urls in array & test the same functionality then it does not give proper results ? Any thoughts why ? Also if any body would like to update answer to make it dynamic in the sense (should check multiple url at once, when an array of url provided).

  <?php
    
    // URL to check
    $url = 'https://www.shareasale.com/m-pr.cfm?merchantID=66802&userID=1860618&productID=1186005518';
      
    $ch = curl_init(); // Initialize a CURL session.
    curl_setopt($ch, CURLOPT_URL, $url); // Grab URL and pass it to the variable.
    curl_setopt($ch, CURLOPT_RETURNTRANSFER, TRUE); // Catch output (do NOT print!)
    curl_setopt($ch, CURLOPT_FOLLOWLOCATION, TRUE); // Return follow location true
    $html = curl_exec($ch);
    $redirectedUrl = curl_getinfo($ch, CURLINFO_EFFECTIVE_URL); // Getinfo or redirected URL from effective URL
    curl_close($ch); // Close handle
    
    $get_final_url = get_final_url($redirectedUrl);
    if($get_final_url){
        echo is_url_valid($get_final_url);
    }else{
        echo $redirectedUrl ? is_url_valid($redirectedUrl) : is_url_valid($url);
    }
    
    function is_url_valid($url) {
      $handle = curl_init($url);
      curl_setopt($handle, CURLOPT_RETURNTRANSFER, true);
      curl_setopt($handle, CURLOPT_NOBODY, true);
      curl_exec($handle);
     
      $httpCode = intval(curl_getinfo($handle, CURLINFO_HTTP_CODE));
      curl_close($handle);
      echo $httpCode;
      if ($httpCode == 200) {
        return '<b> Valid link </b>';
      }
      else {
        return '<b> Invalid link </b>';
      }
    }
    
    function get_final_url($url) {
            $ch = curl_init();
            if (!$ch) {
                return false;
            }
            $ret = curl_setopt($ch, CURLOPT_URL,            $url);
            $ret = curl_setopt($ch, CURLOPT_HEADER,         1);
            $ret = curl_setopt($ch, CURLOPT_FOLLOWLOCATION, 1);
            $ret = curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
            $ret = curl_setopt($ch, CURLOPT_TIMEOUT,        30);
            $ret = curl_exec($ch);
    
            if (!empty($ret)) {
                $info = curl_getinfo($ch);
                curl_close($ch);
                return false;
            if (empty($info['http_code'])) {
                return false;
            } else {
                preg_match('#(https:.*?)\'\)#', $ret, $match);
                $final_url = stripslashes($match[1]);
                return stripslashes($match[1]);
            }
        }
    } 
  • just an idea: requests from your script come in with a pattern the host detects and then counteracts your intends. or as you would perhaps word it: why does that host undermine my expectations? it's their server, you can only send requests and you have to live with the answer (response) ;) – hakre Jul 05 '21 at 12:27
0

see, the problem here is that you want to follow JAVASCRIPT redirects, the url you're complaining about https://www.shareasale.com/m-pr.cfm?merchantID=66802&userID=1860618&productID=1186005518 does redirect to a url responding HTTP 200 OK, and that page contains the javascript

<script LANGUAGE="JavaScript1.2">
                window.location.replace('https:\/\/www.tenthousandvillages.com\/bicycle-statue?sscid=71k5_4yt9r ')
                </script>

so your browser, which understands javascript, follows the javascript redirect, and that js redirect is to a 404 page.. unfortunately there is no good way to do this from PHP, your best bet would probably be a headless web browser, eg PhantomJS or puppeteer or Selenium or something like that.

still, you can kinda hack in a regex-search for a javascript redirect and hope for the best, eg

<?php
function is_url_valid(string $url):bool{
    if(0!==strncasecmp($url,"http",strlen("http"))){
        // file:///etc/passwd and stuff like that aren't considered valid urls right?
        return false;
    }
    $ch=curl_init();
    if(!curl_setopt_array($ch,array(
        CURLOPT_URL=>$url,
        CURLOPT_FOLLOWLOCATION=>1,
        CURLOPT_RETURNTRANSFER=>1
    ))){
        // best guess: the url is so malformed that even CURLOPT_URL didn't accept it.
        return false;
    }
    $resp= curl_exec($ch);
    if(false===$resp){
        return false;
    }
    if(curl_getinfo($ch,CURLINFO_RESPONSE_CODE) != 200){
        // only HTTP 200 OK is accepted
        return false;
    }
    // attempt to detect javascript redirects... sigh
    // window.location.replace('https:\/\/www.tenthousandvillages.com\/bicycle-statue?sscid=71k5_4yt9r ')
    $rex = '/location\.replace\s*\(\s*(?<redirect>(?:\'|\")[\s\S]*?(?:\'|\"))/';
    if(!preg_match($rex, $resp, $matches)){
        // no javascript redirects detected..
        return true;
    }else{
        // javascript redirect detected..
        $url = trim($matches["redirect"]);
        // javascript allows both ' and " for strings, but json only allows " for strings
        $url = str_replace("'",'"',$url);
        $url = json_decode($url, true,512,JSON_THROW_ON_ERROR); // we extracted it from javascript, need json decoding.. (well, strictly speaking, it needs javascript decoding, but json decoding is probably sufficient, and we only have a json decoder nearby)
        curl_close($ch);
        return is_url_valid($url);
    }
}
var_dump(

    is_url_valid('https://www.shareasale.com/m-pr.cfm?merchantID=66802&userID=1860618&productID=1186005518'),
    is_url_valid('http://example.org'),
    is_url_valid('http://example12k34jr43r5ehjegeesfmwefdc.org'),
    
);

but that's a dodgy hacky solution, to put it mildly..

hanshenrik
  • 19,904
  • 4
  • 43
  • 89
  • Thanks for the answer, let me check will it work with multiple url at once like if i crate an array of url posted in question & call method "is_url_valid" in loop –  Jul 05 '21 at 12:11
  • @devhs shouldn't be a problem, btw i just noticed this approach has another significant weakness: it doesn't handle infinite redirects. eg if page1 redirects to page2 redirects to page1 redirects to page2.... , this script will just follow the redirects forever, until a php max_execution_time is reached, or until the call-stack is exhausted. (it's possible to fix though) – hanshenrik Jul 05 '21 at 12:33
  • Thanks i will check it, I has justed tested it here - https://paiza.io/projects/N3m4E11HZAmq5uTb8gLjcg but ii does not seems to be working. –  Jul 05 '21 at 12:35
  • @devhs that url returns bool(true) for me if make sure to change it from `paiza.io` to `https://paiza.io` , what do you get? – hanshenrik Jul 05 '21 at 12:44
  • When you visit this link, its a compiler where i have tested your code paiza.io/projects/N3m4E11HZAmq5uTb8gLjcg –  Jul 05 '21 at 12:45
  • @devhs hmm actually i tested, the paiza.io timeout limit is 2000 milliseconds.. but checking that first url takes about 1956 milliseconds on my laptop, i guess their system is a bit slower and its just enough to hit the timeout limit :P – hanshenrik Jul 05 '21 at 13:02