Easy way to test a URL for 404 in PHP?

Question

I'm teaching myself some basic scraping and I've found that sometimes the URL's that I feed into my code return 404, which gums up all the rest of my code.

So I need a test at the top of the code to check if the URL returns 404 or not.

This would seem like a pretty straightfoward task, but Google's not giving me any answers. I worry I'm searching for the wrong stuff.

One blog recommended I use this:

$valid = @fsockopen($url, 80, $errno, $errstr, 30);

and then test to see if $valid if empty or not.

But I think the URL that's giving me problems has a redirect on it, so $valid is coming up empty for all values. Or perhaps I'm doing something else wrong.

I've also looked into a "head request" but I've yet to find any actual code examples I can play with or try out.

Suggestions? And what's this about curl?

strager · Accepted Answer · 2009-01-03T01:25:59.223

299

If you are using PHP's curl bindings, you can check the error code using curl_getinfo as such:

$handle = curl_init($url);
curl_setopt($handle,  CURLOPT_RETURNTRANSFER, TRUE);

/* Get the HTML or whatever is linked in $url. */
$response = curl_exec($handle);

/* Check for 404 (file not found). */
$httpCode = curl_getinfo($handle, CURLINFO_HTTP_CODE);
if($httpCode == 404) {
    /* Handle 404 here. */
}

curl_close($handle);

/* Handle $response here. */

edited Jan 03 '09 at 01:25

answered Jan 03 '09 at 00:56

strager

88,763
26
134
176

1

I'm not familiar with cURL yet, so I'm missing a few concepts. What do I do with the $response variable down below? What does it contain? – Jan 03 '09 at 01:09
2

@bflora, I made a mistake in the code. (Will fix in a second.) You can see the documentation for curl_exec on PHP's site. – strager Jan 03 '09 at 01:24
4

@bflora $response will contain the content of the $url so you can do additional things like checking the content for specific strings or whatever. In your case, you just care about the 404 state, so you probably do not need to worry about $response. – Beau Simensen Jan 03 '09 at 01:42
Interesting. Right now I'm using $html = new DOMDocument(); @$html->loadHTMLFile($url); $xml = simplexml_import_dom($html); To get the contents of the URLs and step through them to get the elements I need to pull in. Would curl be better? – Jan 03 '09 at 02:17
1

@bflora, If you send a request to the server, it will process your request and return an HTTP code along with the data. If you request twice, your script is about twice as slow (I/O is the slowest part, usually). If you use the data you received on the first request, it'd be faster. – strager Jan 03 '09 at 02:24
1

@bflora, Also, there's some option in PHP which disallows you to fopen() a URL (and DOMDocument probably uses fopen() in loadHTMLFile()). curl is superior, and it allows for much more configurability (e.g. you can ask for the response to be compressed, or in another language). – strager Jan 03 '09 at 02:25
6

What if you just want to headers to load instead of downloading the whole file? – patrick Mar 11 '14 at 22:28
16

@patrick then you need to specify `curl_setopt($handle, CURLOPT_NOBODY, true);` before running `curl_exec` – user Nov 28 '14 at 03:39
1

can I get a real-time example? – Gem Mar 10 '18 at 12:01
1

What about redirect, 302 code, to 404? Where is CURLOPT_FOLLOWLOCATION? – dima.rus Jul 28 '19 at 07:56

score 110 · Answer 2 · edited Feb 04 '13 at 09:53

If your running php5 you can use:

$url = 'http://www.example.com';
print_r(get_headers($url, 1));

Alternatively with php4 a user has contributed the following:

/**
This is a modified version of code from "stuart at sixletterwords dot com", at 14-Sep-2005 04:52. This version tries to emulate get_headers() function at PHP4. I think it works fairly well, and is simple. It is not the best emulation available, but it works.

Features:
- supports (and requires) full URLs.
- supports changing of default port in URL.
- stops downloading from socket as soon as end-of-headers is detected.

Limitations:
- only gets the root URL (see line with "GET / HTTP/1.1").
- don't support HTTPS (nor the default HTTPS port).
*/

if(!function_exists('get_headers'))
{
    function get_headers($url,$format=0)
    {
        $url=parse_url($url);
        $end = "\r\n\r\n";
        $fp = fsockopen($url['host'], (empty($url['port'])?80:$url['port']), $errno, $errstr, 30);
        if ($fp)
        {
            $out  = "GET / HTTP/1.1\r\n";
            $out .= "Host: ".$url['host']."\r\n";
            $out .= "Connection: Close\r\n\r\n";
            $var  = '';
            fwrite($fp, $out);
            while (!feof($fp))
            {
                $var.=fgets($fp, 1280);
                if(strpos($var,$end))
                    break;
            }
            fclose($fp);

            $var=preg_replace("/\r\n\r\n.*\$/",'',$var);
            $var=explode("\r\n",$var);
            if($format)
            {
                foreach($var as $i)
                {
                    if(preg_match('/^([a-zA-Z -]+): +(.*)$/',$i,$parts))
                        $v[$parts[1]]=$parts[2];
                }
                return $v;
            }
            else
                return $var;
        }
    }
}

Both would have a result similar to:

Array
(
    [0] => HTTP/1.1 200 OK
    [Date] => Sat, 29 May 2004 12:28:14 GMT
    [Server] => Apache/1.3.27 (Unix)  (Red-Hat/Linux)
    [Last-Modified] => Wed, 08 Jan 2003 23:11:55 GMT
    [ETag] => "3f80f-1b6-3e1cb03b"
    [Accept-Ranges] => bytes
    [Content-Length] => 438
    [Connection] => close
    [Content-Type] => text/html
)

Therefore you could just check to see that the header response was OK eg:

$headers = get_headers($url, 1);
if ($headers[0] == 'HTTP/1.1 200 OK') {
//valid 
}

if ($headers[0] == 'HTTP/1.1 301 Moved Permanently') {
//moved or redirect page
}

W3C Codes and Definitions

I made a few formatting improvements of your answer, I also added in the ability for https: `get_headers($https_url,1,443);` I am sure it will work though it is not in the standard `get_headers()` function.. Feel free to test it and respond with a status for it. — JamesM-SiteGen, Feb 06 '11 at 05:07
nice workaround for php4, but for cases like this we have the HEAD http method. — vidstige, Jan 16 '13 at 21:16
This solution is not valid when target URL redirects to 404. In this case $headers[0] will be a redirect code, and final 404 code will be appended somewhere later in returning array. — roomcays, Oct 17 '13 at 16:33
This ends up being more trouble than it's worth in php to filter out the actual code from the resultant string, when trying to simply deal with the status code in a script, as opposed to echoing out the result for reading. — Kzqai, Jun 10 '16 at 18:51
Kzqai thats hardly any trouble at all, calling an entire app to do that work is just silly if you dont have other uses for curl going on — That Realty Programmer Guy, May 01 '18 at 20:11
`get_headers($url, 1)[0] === "HTTP/1.1 200 OK"` was just what I wanted! Thank you so much! — RedGuy11, Feb 20 '21 at 21:33

Aram Kocharyan · Answer 3 · 2011-01-08T02:54:47.043

40

With strager's code, you can also check the CURLINFO_HTTP_CODE for other codes. Some websites do not report a 404, rather they simply redirect to a custom 404 page and return 302 (redirect) or something similar. I used this to check if an actual file (eg. robots.txt) existed on the server or not. Clearly this kind of file would not cause a redirect if it existed, but if it didn't it would redirect to a 404 page, which as I said before may not have a 404 code.

function is_404($url) {
    $handle = curl_init($url);
    curl_setopt($handle,  CURLOPT_RETURNTRANSFER, TRUE);

    /* Get the HTML or whatever is linked in $url. */
    $response = curl_exec($handle);

    /* Check for 404 (file not found). */
    $httpCode = curl_getinfo($handle, CURLINFO_HTTP_CODE);
    curl_close($handle);

    /* If the document has loaded successfully without any redirection or error */
    if ($httpCode >= 200 && $httpCode < 300) {
        return false;
    } else {
        return true;
    }
}

edited Jan 08 '11 at 02:54

answered Jan 03 '11 at 13:31

Aram Kocharyan

20,165
11
81
96

5

+1 for the use of "success" HTTP codes instead of 404... The user may get a `408 Request Timeout`, not a `404` – mrcendre May 05 '13 at 12:20
Worked lika a charm. I use this to check if a article on ebay is still online. – Nerdkowski Jan 26 '16 at 13:36
1

For those who expect above code to work with https try to add following: `curl_setopt($handle, CURLOPT_SSL_VERIFYPEER, FALSE); curl_setopt($handle, CURLOPT_SSL_VERIFYHOST, FALSE);` – Kirk Hammett Oct 19 '17 at 18:52
but this would also return 404 =true if there's a legitimate 302 redirect? – Robert Sinclair Oct 31 '18 at 15:20

score 24 · Answer 4 · answered Jan 03 '09 at 00:59

24

As strager suggests, look into using cURL. You may also be interested in setting CURLOPT_NOBODY with curl_setopt to skip downloading the whole page (you just want the headers).

answered Jan 03 '09 at 00:59

Beau Simensen

4,558
3
38
55

1

+1 for mentioning me^W^Wproviding a more efficient alternative, in the case where only the header needs to be checked. =] – strager Jan 03 '09 at 01:04

score 16 · Answer 5 · answered May 12 '11 at 14:43

16

If you are looking for an easiest solution and the one you can try in one go on php5 do

file_get_contents('www.yoursite.com');
//and check by echoing
echo $http_response_header[0];

answered May 12 '11 at 14:43

Nasaralla

1,839
13
11

3

btw, if doing this and the url 404's, a warning is raised, causing output. – Krista K Jan 10 '15 at 10:39
easier to do $isExists= @file_get_contents('www.yoursite.com'); if ($isExists !== true) { echo "yields 404 " } – Tebe Dec 16 '17 at 22:03
put in a try catch, then handle the 404 with catch – That Realty Programmer Guy May 01 '18 at 20:12

Sebastian Viereck · Answer 6 · 2023-05-03T08:17:25.980

8

This function return the status code of an URL in PHP without downloading the body (the html code) by using a HEAD request:

function isHttpStatusCode200(string $url): bool
{
    return getHttpResponseCode($url) === 200;
}

function getHttpResponseCode(string $url): int
{
   $context = stream_context_create(
       array(
           'http' => array(
               'method' => 'HEAD'
           )
       )
   );
    $headers = get_headers($url, false, $context);
    return substr($headers[0], 9, 3);
}

Example:

echo isHttpStatusCode200('https://www.google.com');
//displays: true

edited May 03 '23 at 08:17

answered Apr 14 '20 at 10:21

Sebastian Viereck

5,455
53
53

This definitely must be higher! – Martin Aug 04 '22 at 14:05
1

The stream_context_set_default will change any future HTTP requests and possibly break them. It would be best to use $alternate = stream_context_create() and use it in get_headers($url, false, $alternate) – 3c71 May 02 '23 at 08:46

score 7 · Answer 7 · answered Jan 03 '09 at 00:55

I found this answer here:

if(($twitter_XML_raw=file_get_contents($timeline))==false){
    // Retrieve HTTP status code
    list($version,$status_code,$msg) = explode(' ',$http_response_header[0], 3);

    // Check the HTTP Status code
    switch($status_code) {
        case 200:
                $error_status="200: Success";
                break;
        case 401:
                $error_status="401: Login failure.  Try logging out and back in.  Password are ONLY used when posting.";
                break;
        case 400:
                $error_status="400: Invalid request.  You may have exceeded your rate limit.";
                break;
        case 404:
                $error_status="404: Not found.  This shouldn't happen.  Please let me know what happened using the feedback link above.";
                break;
        case 500:
                $error_status="500: Twitter servers replied with an error. Hopefully they'll be OK soon!";
                break;
        case 502:
                $error_status="502: Twitter servers may be down or being upgraded. Hopefully they'll be OK soon!";
                break;
        case 503:
                $error_status="503: Twitter service unavailable. Hopefully they'll be OK soon!";
                break;
        default:
                $error_status="Undocumented error: " . $status_code;
                break;
    }

Essentially, you use the "file get contents" method to retrieve the URL, which automatically populates the http response header variable with the status code.

Interesting -- I'd never heard of that magic global before. http://php.net/manual/en/reserved.variables.httpresponseheader.php — Frank Farmer, Sep 29 '09 at 23:06

score 7 · Answer 8 · answered Mar 23 '18 at 11:36

7

This will give you true if url does not return 200 OK

function check_404($url) {
   $headers=get_headers($url, 1);
   if ($headers[0]!='HTTP/1.1 200 OK') return true; else return false;
}

answered Mar 23 '18 at 11:36

Juergen Schulze

1,515
21
29

1

This is much faster than using cURL, if you want to do a simple bool check on a url. Thank you. – Drmzindec May 06 '19 at 12:41

score 6 · Answer 9 · answered Jan 22 '14 at 16:05

addendum;tested those 3 methods considering performance.

The result, at least in my testing environment:

Curl wins

This test is done under the consideration that only the headers (noBody) is needed. Test yourself:

$url = "http://de.wikipedia.org/wiki/Pinocchio";

$start_time = microtime(TRUE);
$headers = get_headers($url);
echo $headers[0]."<br>";
$end_time = microtime(TRUE);
echo $end_time - $start_time."<br>";


$start_time = microtime(TRUE);
$response = file_get_contents($url);
echo $http_response_header[0]."<br>";
$end_time = microtime(TRUE);
echo $end_time - $start_time."<br>";

$start_time = microtime(TRUE);
$handle = curl_init($url);
curl_setopt($handle,  CURLOPT_RETURNTRANSFER, TRUE);
curl_setopt($handle, CURLOPT_NOBODY, 1); // and *only* get the header 
/* Get the HTML or whatever is linked in $url. */
$response = curl_exec($handle);
/* Check for 404 (file not found). */
$httpCode = curl_getinfo($handle, CURLINFO_HTTP_CODE);
// if($httpCode == 404) {
    // /* Handle 404 here. */
// }
echo $httpCode."<br>";
curl_close($handle);
$end_time = microtime(TRUE);
echo $end_time - $start_time."<br>";

score 4 · Answer 10 · edited Nov 04 '16 at 16:53

Here is a short solution.

$handle = curl_init($uri);
curl_setopt($handle,  CURLOPT_RETURNTRANSFER, TRUE);
curl_setopt($handle,CURLOPT_HTTPHEADER,array ("Accept: application/rdf+xml"));
curl_setopt($handle, CURLOPT_NOBODY, true);
curl_exec($handle);
$httpCode = curl_getinfo($handle, CURLINFO_HTTP_CODE);
if($httpCode == 200||$httpCode == 303) 
{
    echo "you might get a reply";
}
curl_close($handle);

In your case, you can change application/rdf+xml to whatever you use.

score 2 · Answer 11 · answered Jun 24 '14 at 06:48

<?php

$url= 'www.something.com';
$ch = curl_init($url);
curl_setopt($ch, CURLOPT_HEADER, true);   
curl_setopt($ch, CURLOPT_NOBODY, true);    
curl_setopt($ch, CURLOPT_USERAGENT, "Mozilla/5.0 (Windows; U; Windows NT 6.0; en-US; rv:1.9.0.3) Gecko/2008092417 Firefox/3.0.4");
curl_setopt($ch, CURLOPT_RETURNTRANSFER,1);
curl_setopt($ch, CURLOPT_TIMEOUT,10);
curl_setopt($ch, CURLOPT_ENCODING, "gzip");
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, 1);
$output = curl_exec($ch);
$httpcode = curl_getinfo($ch, CURLINFO_HTTP_CODE);
curl_close($ch);


echo $httpcode;
?>

score 2 · Answer 12 · answered Aug 14 '11 at 14:01

2

As an additional hint to the great accepted answer:

When using a variation of the proposed solution, I got errors because of php setting 'max_execution_time'. So what I did was the following:

set_time_limit(120);
$curl = curl_init($url);
curl_setopt($curl, CURLOPT_NOBODY, true);
$result = curl_exec($curl);
set_time_limit(ini_get('max_execution_time'));
curl_close($curl);

First I set the time limit to a higher number of seconds, in the end I set it back to the value defined in the php settings.

answered Aug 14 '11 at 14:01

markus

40,136
23
97
142

hhhmmmm... besides... your code consumes less resources cause you are not returning the content ... still if you could add return transfer to false then can save much of resources when people using multiple calls... beginners dont think much and so it the reason for 40 up votes... thats fine... – Jayapal Chandran Mar 07 '12 at 14:26

T.Todua · Answer 13 · 2013-03-26T14:17:20.963

You can use this code too, to see the status of any link:

<?php

function get_url_status($url, $timeout = 10) 
{
$ch = curl_init();
// set cURL options
$opts = array(CURLOPT_RETURNTRANSFER => true, // do not output to browser
            CURLOPT_URL => $url,            // set URL
            CURLOPT_NOBODY => true,         // do a HEAD request only
            CURLOPT_TIMEOUT => $timeout);   // set timeout
curl_setopt_array($ch, $opts);
curl_exec($ch); // do it!
$status = curl_getinfo($ch, CURLINFO_HTTP_CODE); // find HTTP status
curl_close($ch); // close handle
echo $status; //or return $status;
    //example checking
    if ($status == '302') { echo 'HEY, redirection';}
}

get_url_status('http://yourpage.comm');
?>

score 1 · Answer 14 · answered Nov 12 '19 at 23:05

Here's a way!

<?php

$url = "http://www.google.com";

if(@file_get_contents($url)){
echo "Url Exists!";
} else {
echo "Url Doesn't Exist!";
}

?>

This simple script simply makes a request to the URL for its source code. If the request is completed successfully, it will output "URL Exists!". If not, it will output "URL Doesn't Exist!".

score 0 · Answer 15 · answered Jan 03 '09 at 01:01

this is just and slice of code, hope works for you

            $ch = @curl_init();
            @curl_setopt($ch, CURLOPT_URL, 'http://example.com');
            @curl_setopt($ch, CURLOPT_USERAGENT, "Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8.1.1) Gecko/20061204 Firefox/2.0.0.1");
            @curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
            @curl_setopt($ch, CURLOPT_FOLLOWLOCATION, 1);
            @curl_setopt($ch, CURLOPT_TIMEOUT, 10);

            $response       = @curl_exec($ch);
            $errno          = @curl_errno($ch);
            $error          = @curl_error($ch);

                    $response = $response;
                    $info = @curl_getinfo($ch);
return $info['http_code'];

Easy way to test a URL for 404 in PHP?

15 Answers15

Linked

Related