4

Running the following Code

var_dump(get_headers("http://www.domainnnnnnnnnnnnnnnnnnnnnnnnnnnn.com/CraxyFile.jpg"));

Returns HTTP 200 instead of 404 For any domain or URL that does not exist

Array
(
    [0] => HTTP/1.1 200 OK
    [1] => Server: nginx/1.1.15
    [2] => Date: Mon, 08 Oct 2012 12:29:13 GMT
    [3] => Content-Type: text/html; charset=utf-8
    [4] => Connection: close
    [5] => Set-Cookie: PHPSESSID=3iucojet7bt2peub72rgo0iu21; path=/; HttpOnly
    [6] => Expires: Thu, 19 Nov 1981 08:52:00 GMT
    [7] => Cache-Control: no-store, no-cache, must-revalidate, post-check=0, pre-check=0
    [8] => Pragma: no-cache
    [9] => Set-Cookie: bypassStaticCache=deleted; expires=Thu, 01-Jan-1970 00:00:01 GMT; path=/; httponly
    [10] => Set-Cookie: bypassStaticCache=deleted; expires=Thu, 01-Jan-1970 00:00:01 GMT; path=/; httponly
    [11] => Vary: Accept
)

If you Run

var_dump(get_headers("http://www.domain.com/CraxyFile.jpg"));

You get

Array
(
    [0] => HTTP/1.1 404 Not Found
    [1] => Date: Mon, 08 Oct 2012 12:32:18 GMT
    [2] => Content-Type: text/html
    [3] => Content-Length: 8727
    [4] => Connection: close
    [5] => Server: Apache
    [6] => Vary: Accept-Encoding
)

They are so many instances where get_headers has been proven to be a solution to validate existing URL

Is This is a Bug or get_headers is not a reliable way for validating URL

See Live Demo

UPDATE 1

Got to find out that CURL also has the same issue

$curl = curl_init();
curl_setopt_array($curl, array(CURLOPT_RETURNTRANSFER => true,CURLOPT_URL => 'idontexist.tld'));
curl_exec($curl);
$info = curl_getinfo($curl);
curl_close($curl);
var_dump($info);

Also returns the same result

Community
  • 1
  • 1
Baba
  • 94,024
  • 28
  • 166
  • 217
  • 5
    At a guess you are behind a transparent proxy that is serving its own error pages with a 200 response code. Probably something like OpenDNS. I suspect you will find that all the domains causing this are resolving to the same IP. – DaveRandom Oct 08 '12 at 12:33
  • perhaps `http://www.domainnnnnnnnnnnnnnnnnnnnnnnnnnnn.com/CraxyFile.jpg` exists but `http://www.domain.com/CraxyFile.jpg` doesn't? – SDC Oct 08 '12 at 12:33
  • @SDC: He says it's a recurring problem will all very long domain names. – Madara's Ghost Oct 08 '12 at 12:35
  • Also please note that one question mark per sentence is sufficient. – DaveRandom Oct 08 '12 at 12:35
  • See Live Demo http://codepad.viper-7.com/TAIPWk – Baba Oct 08 '12 at 12:35
  • how long does a domain have to be in order to exhibit this behaviour? do you know what the cut-off length is? – SDC Oct 08 '12 at 12:36
  • I get "Warning: get_headers(): php_network_getaddresses: getaddrinfo failed: No such host is known. in D:\Websites\htdocs\tests\index.php on line 3" when running at my local devenv. – Madara's Ghost Oct 08 '12 at 12:36
  • @SDC am not sure test over 100 diffident URLS both local and on remote server eg . `var_dump(get_headers("http://www.45645454354353453453454.com/CraxyFile.jpg"));` and it still gives HTTP 200 – Baba Oct 08 '12 at 12:38
  • 2
    @Baba http://codepad.viper-7.com/s2YnY0 - note how both of the non-existent domains resolve to the same IP. Like I say, you are using a DNS service that resolves non-existent domains to some server that gives you a "friendly" error page with a 200 response code. This is very annoying behaviour I admit, but the solution is not to use those services. If you want a generic internet DNS service that does not do this, I recommend Google's open servers `8.8.8.8` and `8.8.4.4` – DaveRandom Oct 08 '12 at 12:41
  • +1 @DaveRandom ... In directly does it mean `get_headers` is not reliable ??? and please add your comment to answer .. people need to know this – Baba Oct 08 '12 at 12:42
  • @Baba - what I mean is, DaveRandom's suggestion sounds plausible. I wonder if there's a specific cut-off length that's causing the error? Or possibly it's just that the shorter domains you've tried do actually exist whereas the longer ones don't? – SDC Oct 08 '12 at 12:43
  • 1
    @SDC It's nothing to do with length, it is simply whether the name exists. Consider the first domain of [this](http://codepad.viper-7.com/peEWSF) example. – DaveRandom Oct 08 '12 at 12:46
  • @DaveRandom i totally agree valid point ... is there a work around ?? – Baba Oct 08 '12 at 12:50
  • @Baba just writing an answer now – DaveRandom Oct 08 '12 at 12:54
  • @DaveRandom ... cool expecting .... – Baba Oct 08 '12 at 13:00
  • 1
    I really thing people should check the meaning of `too localized` – Baba Oct 09 '12 at 13:18
  • 1
    why on earth is this topic closed?? It's a very valid problem and not localized at all! I'm running into the same problem at the moment... – patrick May 27 '14 at 13:21
  • 1
    @patrick because of some clowns to don't know the meaning of localized – Baba May 27 '14 at 15:31

1 Answers1

11

The problem is nothing to do with the length of the domain name, it is simply whether the domain exists.

You are using a DNS service that resolves non-existent domains to a server that gives you a "friendly" error page, which it returns with a 200 response code. This means it is also not a problem with get_headers() specifically, it is any procedure with an underlying reliance on sensible DNS lookups.

A way to handle this without hardcoding a work around for every environment you work in might look something like this:

// A domain that definitely does not exist. The easiest way to guarantee that
// this continues to work is to use an illegal top-level domain (TLD) suffix
$testDomain = 'idontexist.tld';

// If this resolves to an IP, we know that we are behind a service such as this
// We can simply compare the actual domain we test with the result of this
$badIP = gethostbyname($testDomain);

// Then when you want to get_headers()
$url = 'http://www.domainnnnnnnnnnnnnnnnnnnnnnnnnnnn.com/CraxyFile.jpg';

$host = parse_url($url, PHP_URL_HOST);
if (gethostbyname($host) === $badIP) {
  // The domain does not exist - probably handle this as if it were a 404
} else {
  // do the actual get_headers() stuff here
}

You may want to somehow cache the return value of the first call to gethostbyname(), since you know you are looking up a name that does not exist, and this can often take a few seconds.

DaveRandom
  • 87,921
  • 11
  • 154
  • 174
  • +1 Nice ... I'll ask you again is This proves that `get_headers` is not reliable – Baba Oct 08 '12 at 13:08
  • 3
    @Baba It's not `get_headers()` specifically, it is really any function that performs a task over the network based on a name instead of an IP address. But in a nutshell, no it is not reliable - for other reasons as well, because it relies on the server handling `HEAD` requests in the same way as it handles `GET` requests, which in many ways is not a safe assumption (even though it should be according to the standards). – DaveRandom Oct 08 '12 at 13:10
  • Thanks ... am definitely not running mad – Baba Oct 08 '12 at 13:16
  • 1
    `get_headers` does not perform a HEAD request by default, but a GET request. http://php.net/get_headers https://bugs.php.net/bug.php?id=55716 – hakre Dec 13 '12 at 05:56