1

Looking at the PHP docs for get_headers()...

array get_headers ( string $url [, int $format = 0 ] )

... there are two ways to run it:

#1 (format === 0)

$headers = get_headers($url);

// or

$headers = get_headers($url, 0);

#2 (format !== 0)

$headers = get_headers($url, 1);

The difference between the two being whether the arrays are numerically indexed (first case)...

(excerpt from docs)

Array
(
    [0] => HTTP/1.1 200 OK
    [1] => Date: Sat, 29 May 2004 12:28:13 GMT
    [2] => Server: Apache/1.3.27 (Unix)  (Red-Hat/Linux)
    ... etc

... or indexed with keys (second case)...

(excerpt from docs)

Array
(
    [0] => HTTP/1.1 200 OK
    [Date] => Sat, 29 May 2004 12:28:14 GMT
    [Server] => Apache/1.3.27 (Unix)  (Red-Hat/Linux)
    [Last-Modified] => Wed, 08 Jan 2003 23:11:55 GMT
    ... etc

In the example given in the docs, the http status code belongs to a numerical index...

[0] => HTTP/1.1 200 OK

... regardless of what format is set to.

Similarly, in every valid URL that I have ever put through get_headers (i.e. many URLs), the status codes have always been under numerical indexes, even when multiple status codes present...

// Output from JSON.stringify(get_headers($url, 1))

{
    "0": "HTTP/1.1 301 Moved Permanently",
    "1": "HTTP/1.1 200 OK",
    "Date": [
        "Thu, 11 Aug 2016 07:12:28 GMT",
        "Thu, 11 Aug 2016 07:12:28 GMT"
    ],
    "Content-Type": [
        "text/html; charset=iso-8859-1",
        "text/html; charset=UTF-8"
    ]
    ... etc

But, I have not (read: cannot) test every URL on every type of server, and so cannot speak in absolutes about the status code indexes.

Is it possible that get_headers($url, 1) could return a non-numerical http status code index? Or is it hard-coded into the function to always return the status codes under numerical indices - no matter what?


Extra reading, not necessary or essential to the question above...

For the curious, my question is mostly to do with optimization. get_headers() is already painfully slow - even when sending a HEAD request instead of GET - and only gets worse after combing through the return array with a preg_match and regex.

(The various CURL methods you'll find are even slower, I've tested them against get_headers() with very long lists of URLs, so holster that hip-shot, partner)

If I know that the status codes are always numerically indexed, then I can speed my code up a bit, by ignoring all non-integer indices, before running them through the preg_match. The difference for one URL might only be fractions of a second, but when running this function all day, every day, those little bits add up.

Additionally (Edit #1)

I'm currently only worried about the final http status code (and URL), after all redirects. I was using a method similar to this to get the final URL.

It seems that after running

$headers = array_reverse($headers);

then the final status code after the redirects will always be in $headers[0]. But, once again, this only is a sure-thing if the status codes are numerically indexed.

Community
  • 1
  • 1
Birrel
  • 4,754
  • 6
  • 38
  • 74
  • Logically speaking, the status code is the first line in the response, and it **doesn't have a _name_**. Every other HTTP header follows the `name: value` format, only the status code line does not. So… it makes no real sense to index it any other way but numerically. What else would you index it by? – deceze Aug 11 '16 at 08:05
  • @deceze nothing at all? I'm not too worried about assigning a name to it, I'm more concerned with whether the index is *always* numerical. – Birrel Aug 11 '16 at 08:09

2 Answers2

3

The PHP C source code for that function looks like this:

        if (!format) {
no_name_header:
            add_next_index_str(return_value, zend_string_copy(Z_STR_P(hdr)));
        } else {
            char c;
            char *s, *p;

            if ((p = strchr(Z_STRVAL_P(hdr), ':'))) {
                ... omitted ...
            } else {
                goto no_name_header;
            }
        }

In other words, it tests if there's a : in the header, and if so proceeds to index it by its name (omitted here). If there's no : or if you did not request to $format the result, no_name_header kicks in and it adds it to the return_value without explicit index.

So, yes, the status lines should always be numerically indexed. Unless the server puts a : into the status line, which would be unusual. Note that RFC 2616 does not explicitly prohibit the use of : in the reason phrase part of the status line:

Status-Line    = HTTP-Version SP Status-Code SP Reason-Phrase CRLF

Reason-Phrase  = *<TEXT, excluding CR, LF>

TEXT           = <any OCTET except CTLs,
                 but including LWS>

There is no standardised reason phrase which contains a ":", but you never know, you may encounter exotic servers in the wild which defy convention here…

Community
  • 1
  • 1
deceze
  • 510,633
  • 85
  • 743
  • 889
  • You're the man for combing through 800+ lines of code! And reading my verbose question! And the 12 edits you've made to your answer just while I've been typing this comment! I'm sure you didn't read it line-by-line, but I still greatly appreciate the effort. I'll keep it as-is for now, and hope for the best. If servers start switching things up it'll ruin a lot of peoples' day, so here's to hoping they won't... – Birrel Aug 11 '16 at 08:29
  • 1
    I've just gotten good at using search tools and skimming. No way I'd read 800 lines of code… :-P – deceze Aug 11 '16 at 08:31
0

Since the response code is always zero indexed, you could assign it associatively and discard the original key.

$headers = get_headers($url,1);
$headers['Http-Response'] = $headers[0];
unset($headers[0]);