I have a catalogue of sites, and I want to exclude those who give 404 or 403 code (respectively can't show anything interesting for users). But using file_get_contents or curl functions of php even with request headers sometimes gives 404 or 403 response even if I can see normal page through browser. What I can use to collect proper codes (to be sure, that site doesn't have content)?
Asked
Active
Viewed 36 times
0
-
2It seems like you're asking the wrong question. A 404 or 403 [will make `file_get_contents()`return `false`](http://stackoverflow.com/questions/4358130/file-get-contents-when-url-doesnt-exist). The real question is: why does a request through `file_get_contents()` give a 403 or 404 when a browser works? Probably because the site recognizes you're scraping, or because you're missing certain cookies or other request variables. – CodeCaster Jan 14 '15 at 13:45
-
I can just try file_get_contents() to check something, because it gives warning with response code,which one can see. Sometimes as I see site can give 404 response, but still show some content. So how can I detect sites, which will definitely return nothing to users? – Pavel Kodentsev Jan 14 '15 at 14:15
1 Answers
0
Try this function
<?php
function Visit($url){
$agent = "Mozilla/4.0 (compatible; MSIE 5.01; Windows NT 5.0)";$ch=curl_init();
curl_setopt ($ch, CURLOPT_URL,$url );
curl_setopt($ch, CURLOPT_USERAGENT, $agent);
curl_setopt ($ch, CURLOPT_RETURNTRANSFER, 1);
curl_setopt ($ch,CURLOPT_VERBOSE,false);
curl_setopt($ch, CURLOPT_TIMEOUT, 5);
curl_setopt($ch,CURLOPT_SSL_VERIFYPEER, FALSE);
curl_setopt($ch,CURLOPT_SSLVERSION,3);
curl_setopt($ch,CURLOPT_SSL_VERIFYHOST, FALSE);
$page=curl_exec($ch);
//echo curl_error($ch);
$httpcode = curl_getinfo($ch, CURLINFO_HTTP_CODE);
curl_close($ch);
if($httpcode>=200 && $httpcode<308) return true;
else return false;
}
if (Visit("http://www.google.com"))
echo "Website OK"."n";
else
echo "Website DOWN";
?>
edited according to W3 definition of status codes

adeam
- 73
- 7
-
@PavelKodentsev try it now, it returns true if webpage responds with status code between 200 and 308, only functional websites should return these codes – adeam Jan 14 '15 at 14:13