0

I have a catalogue of sites, and I want to exclude those who give 404 or 403 code (respectively can't show anything interesting for users). But using file_get_contents or curl functions of php even with request headers sometimes gives 404 or 403 response even if I can see normal page through browser. What I can use to collect proper codes (to be sure, that site doesn't have content)?

Pavel Kodentsev
  • 127
  • 1
  • 11
  • 2
    It seems like you're asking the wrong question. A 404 or 403 [will make `file_get_contents()`return `false`](http://stackoverflow.com/questions/4358130/file-get-contents-when-url-doesnt-exist). The real question is: why does a request through `file_get_contents()` give a 403 or 404 when a browser works? Probably because the site recognizes you're scraping, or because you're missing certain cookies or other request variables. – CodeCaster Jan 14 '15 at 13:45
  • I can just try file_get_contents() to check something, because it gives warning with response code,which one can see. Sometimes as I see site can give 404 response, but still show some content. So how can I detect sites, which will definitely return nothing to users? – Pavel Kodentsev Jan 14 '15 at 14:15

1 Answers1

0

Try this function

 <?php
    function Visit($url){
           $agent = "Mozilla/4.0 (compatible; MSIE 5.01; Windows NT 5.0)";$ch=curl_init();
           curl_setopt ($ch, CURLOPT_URL,$url );
           curl_setopt($ch, CURLOPT_USERAGENT, $agent);
           curl_setopt ($ch, CURLOPT_RETURNTRANSFER, 1);
           curl_setopt ($ch,CURLOPT_VERBOSE,false);
           curl_setopt($ch, CURLOPT_TIMEOUT, 5);
           curl_setopt($ch,CURLOPT_SSL_VERIFYPEER, FALSE);
           curl_setopt($ch,CURLOPT_SSLVERSION,3);
           curl_setopt($ch,CURLOPT_SSL_VERIFYHOST, FALSE);
           $page=curl_exec($ch);
           //echo curl_error($ch);
           $httpcode = curl_getinfo($ch, CURLINFO_HTTP_CODE);
           curl_close($ch);
           if($httpcode>=200 && $httpcode<308) return true;
           else return false;
    }
    if (Visit("http://www.google.com"))
           echo "Website OK"."n";
    else
           echo "Website DOWN";
    ?>

edited according to W3 definition of status codes

adeam
  • 73
  • 7
  • @PavelKodentsev try it now, it returns true if webpage responds with status code between 200 and 308, only functional websites should return these codes – adeam Jan 14 '15 at 14:13