166

I'm teaching myself some basic scraping and I've found that sometimes the URL's that I feed into my code return 404, which gums up all the rest of my code.

So I need a test at the top of the code to check if the URL returns 404 or not.

This would seem like a pretty straightfoward task, but Google's not giving me any answers. I worry I'm searching for the wrong stuff.

One blog recommended I use this:

$valid = @fsockopen($url, 80, $errno, $errstr, 30);

and then test to see if $valid if empty or not.

But I think the URL that's giving me problems has a redirect on it, so $valid is coming up empty for all values. Or perhaps I'm doing something else wrong.

I've also looked into a "head request" but I've yet to find any actual code examples I can play with or try out.

Suggestions? And what's this about curl?

bignose
  • 30,281
  • 14
  • 77
  • 110

15 Answers15

299

If you are using PHP's curl bindings, you can check the error code using curl_getinfo as such:

$handle = curl_init($url);
curl_setopt($handle,  CURLOPT_RETURNTRANSFER, TRUE);

/* Get the HTML or whatever is linked in $url. */
$response = curl_exec($handle);

/* Check for 404 (file not found). */
$httpCode = curl_getinfo($handle, CURLINFO_HTTP_CODE);
if($httpCode == 404) {
    /* Handle 404 here. */
}

curl_close($handle);

/* Handle $response here. */
strager
  • 88,763
  • 26
  • 134
  • 176
  • 1
    I'm not familiar with cURL yet, so I'm missing a few concepts. What do I do with the $response variable down below? What does it contain? –  Jan 03 '09 at 01:09
  • 2
    @bflora, I made a mistake in the code. (Will fix in a second.) You can see the documentation for curl_exec on PHP's site. – strager Jan 03 '09 at 01:24
  • 4
    @bflora $response will contain the content of the $url so you can do additional things like checking the content for specific strings or whatever. In your case, you just care about the 404 state, so you probably do not need to worry about $response. – Beau Simensen Jan 03 '09 at 01:42
  • Interesting. Right now I'm using $html = new DOMDocument(); @$html->loadHTMLFile($url); $xml = simplexml_import_dom($html); To get the contents of the URLs and step through them to get the elements I need to pull in. Would curl be better? –  Jan 03 '09 at 02:17
  • 1
    @bflora, If you send a request to the server, it will process your request and return an HTTP code along with the data. If you request twice, your script is about twice as slow (I/O is the slowest part, usually). If you use the data you received on the first request, it'd be faster. – strager Jan 03 '09 at 02:24
  • 1
    @bflora, Also, there's some option in PHP which disallows you to fopen() a URL (and DOMDocument probably uses fopen() in loadHTMLFile()). curl is superior, and it allows for much more configurability (e.g. you can ask for the response to be compressed, or in another language). – strager Jan 03 '09 at 02:25
  • 6
    What if you just want to headers to load instead of downloading the whole file? – patrick Mar 11 '14 at 22:28
  • 16
    @patrick then you need to specify `curl_setopt($handle, CURLOPT_NOBODY, true);` before running `curl_exec` – user Nov 28 '14 at 03:39
  • 1
    can I get a real-time example? – Gem Mar 10 '18 at 12:01
  • 1
    What about redirect, 302 code, to 404? Where is CURLOPT_FOLLOWLOCATION? – dima.rus Jul 28 '19 at 07:56
110

If your running php5 you can use:

$url = 'http://www.example.com';
print_r(get_headers($url, 1));

Alternatively with php4 a user has contributed the following:

/**
This is a modified version of code from "stuart at sixletterwords dot com", at 14-Sep-2005 04:52. This version tries to emulate get_headers() function at PHP4. I think it works fairly well, and is simple. It is not the best emulation available, but it works.

Features:
- supports (and requires) full URLs.
- supports changing of default port in URL.
- stops downloading from socket as soon as end-of-headers is detected.

Limitations:
- only gets the root URL (see line with "GET / HTTP/1.1").
- don't support HTTPS (nor the default HTTPS port).
*/

if(!function_exists('get_headers'))
{
    function get_headers($url,$format=0)
    {
        $url=parse_url($url);
        $end = "\r\n\r\n";
        $fp = fsockopen($url['host'], (empty($url['port'])?80:$url['port']), $errno, $errstr, 30);
        if ($fp)
        {
            $out  = "GET / HTTP/1.1\r\n";
            $out .= "Host: ".$url['host']."\r\n";
            $out .= "Connection: Close\r\n\r\n";
            $var  = '';
            fwrite($fp, $out);
            while (!feof($fp))
            {
                $var.=fgets($fp, 1280);
                if(strpos($var,$end))
                    break;
            }
            fclose($fp);

            $var=preg_replace("/\r\n\r\n.*\$/",'',$var);
            $var=explode("\r\n",$var);
            if($format)
            {
                foreach($var as $i)
                {
                    if(preg_match('/^([a-zA-Z -]+): +(.*)$/',$i,$parts))
                        $v[$parts[1]]=$parts[2];
                }
                return $v;
            }
            else
                return $var;
        }
    }
}

Both would have a result similar to:

Array
(
    [0] => HTTP/1.1 200 OK
    [Date] => Sat, 29 May 2004 12:28:14 GMT
    [Server] => Apache/1.3.27 (Unix)  (Red-Hat/Linux)
    [Last-Modified] => Wed, 08 Jan 2003 23:11:55 GMT
    [ETag] => "3f80f-1b6-3e1cb03b"
    [Accept-Ranges] => bytes
    [Content-Length] => 438
    [Connection] => close
    [Content-Type] => text/html
)

Therefore you could just check to see that the header response was OK eg:

$headers = get_headers($url, 1);
if ($headers[0] == 'HTTP/1.1 200 OK') {
//valid 
}

if ($headers[0] == 'HTTP/1.1 301 Moved Permanently') {
//moved or redirect page
}

W3C Codes and Definitions

Muhammad Reda
  • 26,379
  • 14
  • 93
  • 105
Asciant
  • 2,130
  • 1
  • 15
  • 26
  • I made a few formatting improvements of your answer, I also added in the ability for https: `get_headers($https_url,1,443);` I am sure it will work though it is not in the standard `get_headers()` function.. Feel free to test it and respond with a status for it. – JamesM-SiteGen Feb 06 '11 at 05:07
  • 1
    nice workaround for php4, but for cases like this we have the HEAD http method. – vidstige Jan 16 '13 at 21:16
  • So this would actually be faster then the curl method ? – FLY Feb 15 '13 at 08:37
  • 4
    This solution is not valid when target URL redirects to 404. In this case $headers[0] will be a redirect code, and final 404 code will be appended somewhere later in returning array. – roomcays Oct 17 '13 at 16:33
  • 1
    This ends up being more trouble than it's worth in php to filter out the actual code from the resultant string, when trying to simply deal with the status code in a script, as opposed to echoing out the result for reading. – Kzqai Jun 10 '16 at 18:51
  • Kzqai thats hardly any trouble at all, calling an entire app to do that work is just silly if you dont have other uses for curl going on – That Realty Programmer Guy May 01 '18 at 20:11
  • `get_headers($url, 1)[0] === "HTTP/1.1 200 OK"` was just what I wanted! Thank you so much! – RedGuy11 Feb 20 '21 at 21:33
40

With strager's code, you can also check the CURLINFO_HTTP_CODE for other codes. Some websites do not report a 404, rather they simply redirect to a custom 404 page and return 302 (redirect) or something similar. I used this to check if an actual file (eg. robots.txt) existed on the server or not. Clearly this kind of file would not cause a redirect if it existed, but if it didn't it would redirect to a 404 page, which as I said before may not have a 404 code.

function is_404($url) {
    $handle = curl_init($url);
    curl_setopt($handle,  CURLOPT_RETURNTRANSFER, TRUE);

    /* Get the HTML or whatever is linked in $url. */
    $response = curl_exec($handle);

    /* Check for 404 (file not found). */
    $httpCode = curl_getinfo($handle, CURLINFO_HTTP_CODE);
    curl_close($handle);

    /* If the document has loaded successfully without any redirection or error */
    if ($httpCode >= 200 && $httpCode < 300) {
        return false;
    } else {
        return true;
    }
}
Aram Kocharyan
  • 20,165
  • 11
  • 81
  • 96
24

As strager suggests, look into using cURL. You may also be interested in setting CURLOPT_NOBODY with curl_setopt to skip downloading the whole page (you just want the headers).

Beau Simensen
  • 4,558
  • 3
  • 38
  • 55
  • 1
    +1 for mentioning me^W^Wproviding a more efficient alternative, in the case where only the header needs to be checked. =] – strager Jan 03 '09 at 01:04
16

If you are looking for an easiest solution and the one you can try in one go on php5 do

file_get_contents('www.yoursite.com');
//and check by echoing
echo $http_response_header[0];
Nasaralla
  • 1,839
  • 13
  • 11
8

This function return the status code of an URL in PHP without downloading the body (the html code) by using a HEAD request:

function isHttpStatusCode200(string $url): bool
{
    return getHttpResponseCode($url) === 200;
}

function getHttpResponseCode(string $url): int
{
   $context = stream_context_create(
       array(
           'http' => array(
               'method' => 'HEAD'
           )
       )
   );
    $headers = get_headers($url, false, $context);
    return substr($headers[0], 9, 3);
}

Example:

echo isHttpStatusCode200('https://www.google.com');
//displays: true
Sebastian Viereck
  • 5,455
  • 53
  • 53
  • This definitely must be higher! – Martin Aug 04 '22 at 14:05
  • 1
    The stream_context_set_default will change any future HTTP requests and possibly break them. It would be best to use $alternate = stream_context_create() and use it in get_headers($url, false, $alternate) – 3c71 May 02 '23 at 08:46
7

I found this answer here:

if(($twitter_XML_raw=file_get_contents($timeline))==false){
    // Retrieve HTTP status code
    list($version,$status_code,$msg) = explode(' ',$http_response_header[0], 3);

    // Check the HTTP Status code
    switch($status_code) {
        case 200:
                $error_status="200: Success";
                break;
        case 401:
                $error_status="401: Login failure.  Try logging out and back in.  Password are ONLY used when posting.";
                break;
        case 400:
                $error_status="400: Invalid request.  You may have exceeded your rate limit.";
                break;
        case 404:
                $error_status="404: Not found.  This shouldn't happen.  Please let me know what happened using the feedback link above.";
                break;
        case 500:
                $error_status="500: Twitter servers replied with an error. Hopefully they'll be OK soon!";
                break;
        case 502:
                $error_status="502: Twitter servers may be down or being upgraded. Hopefully they'll be OK soon!";
                break;
        case 503:
                $error_status="503: Twitter service unavailable. Hopefully they'll be OK soon!";
                break;
        default:
                $error_status="Undocumented error: " . $status_code;
                break;
    }

Essentially, you use the "file get contents" method to retrieve the URL, which automatically populates the http response header variable with the status code.

Ross
  • 9,652
  • 8
  • 35
  • 35
7

This will give you true if url does not return 200 OK

function check_404($url) {
   $headers=get_headers($url, 1);
   if ($headers[0]!='HTTP/1.1 200 OK') return true; else return false;
}
Juergen Schulze
  • 1,515
  • 21
  • 29
6

addendum;tested those 3 methods considering performance.

The result, at least in my testing environment:

Curl wins

This test is done under the consideration that only the headers (noBody) is needed. Test yourself:

$url = "http://de.wikipedia.org/wiki/Pinocchio";

$start_time = microtime(TRUE);
$headers = get_headers($url);
echo $headers[0]."<br>";
$end_time = microtime(TRUE);
echo $end_time - $start_time."<br>";


$start_time = microtime(TRUE);
$response = file_get_contents($url);
echo $http_response_header[0]."<br>";
$end_time = microtime(TRUE);
echo $end_time - $start_time."<br>";

$start_time = microtime(TRUE);
$handle = curl_init($url);
curl_setopt($handle,  CURLOPT_RETURNTRANSFER, TRUE);
curl_setopt($handle, CURLOPT_NOBODY, 1); // and *only* get the header 
/* Get the HTML or whatever is linked in $url. */
$response = curl_exec($handle);
/* Check for 404 (file not found). */
$httpCode = curl_getinfo($handle, CURLINFO_HTTP_CODE);
// if($httpCode == 404) {
    // /* Handle 404 here. */
// }
echo $httpCode."<br>";
curl_close($handle);
$end_time = microtime(TRUE);
echo $end_time - $start_time."<br>";
Email
  • 2,395
  • 3
  • 35
  • 63
4

Here is a short solution.

$handle = curl_init($uri);
curl_setopt($handle,  CURLOPT_RETURNTRANSFER, TRUE);
curl_setopt($handle,CURLOPT_HTTPHEADER,array ("Accept: application/rdf+xml"));
curl_setopt($handle, CURLOPT_NOBODY, true);
curl_exec($handle);
$httpCode = curl_getinfo($handle, CURLINFO_HTTP_CODE);
if($httpCode == 200||$httpCode == 303) 
{
    echo "you might get a reply";
}
curl_close($handle);

In your case, you can change application/rdf+xml to whatever you use.

Panda
  • 2,400
  • 3
  • 25
  • 35
Andreas
  • 970
  • 18
  • 28
2
<?php

$url= 'www.something.com';
$ch = curl_init($url);
curl_setopt($ch, CURLOPT_HEADER, true);   
curl_setopt($ch, CURLOPT_NOBODY, true);    
curl_setopt($ch, CURLOPT_USERAGENT, "Mozilla/5.0 (Windows; U; Windows NT 6.0; en-US; rv:1.9.0.3) Gecko/2008092417 Firefox/3.0.4");
curl_setopt($ch, CURLOPT_RETURNTRANSFER,1);
curl_setopt($ch, CURLOPT_TIMEOUT,10);
curl_setopt($ch, CURLOPT_ENCODING, "gzip");
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, 1);
$output = curl_exec($ch);
$httpcode = curl_getinfo($ch, CURLINFO_HTTP_CODE);
curl_close($ch);


echo $httpcode;
?>
2

As an additional hint to the great accepted answer:

When using a variation of the proposed solution, I got errors because of php setting 'max_execution_time'. So what I did was the following:

set_time_limit(120);
$curl = curl_init($url);
curl_setopt($curl, CURLOPT_NOBODY, true);
$result = curl_exec($curl);
set_time_limit(ini_get('max_execution_time'));
curl_close($curl);

First I set the time limit to a higher number of seconds, in the end I set it back to the value defined in the php settings.

markus
  • 40,136
  • 23
  • 97
  • 142
  • hhhmmmm... besides... your code consumes less resources cause you are not returning the content ... still if you could add return transfer to false then can save much of resources when people using multiple calls... beginners dont think much and so it the reason for 40 up votes... thats fine... – Jayapal Chandran Mar 07 '12 at 14:26
1

You can use this code too, to see the status of any link:

<?php

function get_url_status($url, $timeout = 10) 
{
$ch = curl_init();
// set cURL options
$opts = array(CURLOPT_RETURNTRANSFER => true, // do not output to browser
            CURLOPT_URL => $url,            // set URL
            CURLOPT_NOBODY => true,         // do a HEAD request only
            CURLOPT_TIMEOUT => $timeout);   // set timeout
curl_setopt_array($ch, $opts);
curl_exec($ch); // do it!
$status = curl_getinfo($ch, CURLINFO_HTTP_CODE); // find HTTP status
curl_close($ch); // close handle
echo $status; //or return $status;
    //example checking
    if ($status == '302') { echo 'HEY, redirection';}
}

get_url_status('http://yourpage.comm');
?>
T.Todua
  • 53,146
  • 19
  • 236
  • 237
1

Here's a way!

<?php

$url = "http://www.google.com";

if(@file_get_contents($url)){
echo "Url Exists!";
} else {
echo "Url Doesn't Exist!";
}

?>

This simple script simply makes a request to the URL for its source code. If the request is completed successfully, it will output "URL Exists!". If not, it will output "URL Doesn't Exist!".

0

this is just and slice of code, hope works for you

            $ch = @curl_init();
            @curl_setopt($ch, CURLOPT_URL, 'http://example.com');
            @curl_setopt($ch, CURLOPT_USERAGENT, "Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8.1.1) Gecko/20061204 Firefox/2.0.0.1");
            @curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
            @curl_setopt($ch, CURLOPT_FOLLOWLOCATION, 1);
            @curl_setopt($ch, CURLOPT_TIMEOUT, 10);

            $response       = @curl_exec($ch);
            $errno          = @curl_errno($ch);
            $error          = @curl_error($ch);

                    $response = $response;
                    $info = @curl_getinfo($ch);
return $info['http_code'];