6

I would like to create a batch script, to go through 20,000 links in a DB, and weed out all the 404s and such. How would I get the HTTP status code for a remote url?

Preferably not using curl, since I dont have it installed.

4 Answers4

13

CURL would be perfect but since you don't have it, you'll have to get down and dirty with sockets. The technique is:

  1. Open a socket to the server.
  2. Send an HTTP HEAD request.
  3. Parse the response.

Here is a quick example:

<?php

$url = parse_url('http://www.example.com/index.html');

$host = $url['host'];
$port = $url['port'];
$path = $url['path'];
$query = $url['query'];
if(!$port)
    $port = 80;

$request = "HEAD $path?$query HTTP/1.1\r\n"
          ."Host: $host\r\n"
          ."Connection: close\r\n"
          ."\r\n";

$address = gethostbyname($host);
$socket = socket_create(AF_INET, SOCK_STREAM, SOL_TCP);
socket_connect($socket, $address, $port);

socket_write($socket, $request, strlen($request));

$response = split(' ', socket_read($socket, 1024));

print "<p>Response: ". $response[1] ."</p>\r\n";

socket_close($socket);

?>

UPDATE: I've added a few lines to parse the URL

schpet
  • 9,664
  • 6
  • 32
  • 35
Adam Pierce
  • 33,531
  • 22
  • 69
  • 89
3

If im not mistaken none of the php built-in functions return the http status of a remote url, so the best option would be to use sockets to open a connection to the server, send a request and parse the response status:

pseudo code:

parse url => $host, $port, $path
$http_request = "GET $path HTTP/1.0\nHhost: $host\n\n";
$fp = fsockopen($host, $port, $errno, $errstr, $timeout), check for any errors
fwrite($fp, $request)
while (!feof($fp)) {
   $headers .= fgets($fp, 4096);
   $status = <parse $headers >
   if (<status read>)
     break;
}
fclose($fp)

Another option is to use an already build http client class in php that can return the headers without fetching the full page content, there should be a few open source classes available on the net...

J.C. Inacio
  • 4,442
  • 2
  • 22
  • 25
1

This page looks like it has a pretty good setup to download a page using either curl or fsockopen, and can get the HTTP headers using either method (which is what you want, really).

After using that method, you'd want to check $output['info']['http_code'] to get the data you want.

Hope that helps.

Sean Schulte
  • 3,955
  • 1
  • 17
  • 6
1

You can use PEAR's HTTP::head function.
http://pear.php.net/manual/en/package.http.http.head.php

sanxiyn
  • 3,648
  • 1
  • 19
  • 15