0

Background info:

  • I'm collecting some URLs dynamically from various sources online.
  • I would like to get the URL's content if it's an HTML page or an image.
  • I do not want to load large files (like a download zip, pdf or others) - just to realize that the target is not interesting for me.

Is there a way I can check the response type/format with PHP before actually fetching the content? (to avoid wasting my own and the target servers resources and bandwidth)

(I found get_headers() in the PHP doc, but it is unclear to me, if the function actually fetches the entire content and returns the headers, or somehow only gets the headers from the server, without downloading the content first. I also found solutions to get headers with CURL and fsocketopen, but the question remains, if I can do it without loading actual content)

preyz
  • 3,029
  • 5
  • 29
  • 36
  • 1
    [Related](http://stackoverflow.com/questions/1378915/header-only-retrieval-in-php-via-curl); you can use cURL to just get the header. – Supericy Feb 13 '13 at 21:59
  • 2
    `get_headers()` sends a `GET` request per default. But see example #2 (in the manual) to issue more lightweight `HEAD` requests. – mario Feb 13 '13 at 22:00

3 Answers3

3

Try using an HTTP HEAD request to retrieve only the headers. Something like:

curl_setopt($ch, CURLOPT_CUSTOMREQUEST, 'HEAD');

or (what the manual recommends):

curl_setopt($ch, CURLOPT_NOBODY, true);

(I haven't tested either of these.)

Mark Leighton Fisher
  • 5,609
  • 2
  • 18
  • 29
1

There is a PHP-function for that:

$headers=get_headers("http://www.amazingjokes.com/img/2014/530c9613d29bd_CountvonCount.jpg");
print_r($headers);

returns the following:

Array
(
    [0] => HTTP/1.1 200 OK
    [1] => Date: Tue, 11 Mar 2014 22:44:38 GMT
    [2] => Server: Apache
    [3] => Last-Modified: Tue, 25 Feb 2014 14:08:40 GMT
    [4] => ETag: "54e35e8-8873-4f33ba00673f4"
    [5] => Accept-Ranges: bytes
    [6] => Content-Length: 34931
    [7] => Connection: close
    [8] => Content-Type: image/jpeg
)

Should be easy to get the content-type after this.

More reading here (PHP.NET)

patrick
  • 11,519
  • 8
  • 71
  • 80
  • 1
    Add a flag to get associative array. This way you can pick a particular header independent of its order in all the headers. e.g. $headers = get_headers("", true); – Satish Gadhave Apr 02 '14 at 14:01
0

Here is a solution using cURL with a CURLOPT_WRITEFUNCTION callback function. In it, I check the incoming header to find the content type. If it's not what we want, it tells cURL to abort, so you don't waste time getting the body of the request.

$ch = curl_init('http://stackoverflow.com/');
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
curl_setopt($ch, CURLOPT_HEADER, true);

$data = '';
$haveHeader = false;

curl_setopt($ch, CURLOPT_WRITEFUNCTION, function($ch, $chunk) use (&$haveHeader, &$data) {
    if (!$haveHeader && ($chunk == "\n" || $chunk == "\r\n")) {
        // detected end of header
        $haveHeader = true;
    } else if (!$haveHeader) {
        // detected content type
        if (preg_match('/content-type:\s*([^;]+)/i', $chunk, $matches)) {
            $contentType = strtolower($matches[1]);
            // check if content type is what we want
            if ($contentType != 'text/html' && strpos($contentType, 'image/') === false) {
                // tell curl to abort
                return false;
            }
        }
    } else {
        // append to data (body/content)
        $data .= $chunk;
    }

    return strlen($chunk);
});

if (curl_exec($ch)) {
    // use $data here
    echo strlen($data);
}
Jonathan Amend
  • 12,715
  • 3
  • 22
  • 29