1

Possible Duplicate:
cURL Mult Simultaneous Requests (domain check)

I'm trying to check to see if a website exists. (if it responds that's good enough) The issue is my array of domains is 20,000 and I'm trying to speed up the process as much as possible.

I've done some research and come across this page which details simultaneous cURL requests -> http://www.phpied.com/simultaneuos-http-requests-in-php-with-curl/

I also found this page which seems be a good way of checking if a domain webpage is up -> http://www.wrichards.com/blog/2009/05/php-check-if-a-url-exists-with-curl/

Any ideas on how to quickly check 20,000 domains to see if they are up?

Community
  • 1
  • 1
user1647347
  • 507
  • 1
  • 8
  • 23

4 Answers4

2
$http = curl_init($url);
$result = curl_exec($http);
$http_status = curl_getinfo($http, CURLINFO_HTTP_CODE);
curl_close($http);
if($http_status == 200) // good here
Ozerich
  • 2,000
  • 1
  • 18
  • 25
  • That's pretty similar to what I've listed above as a reference. Doesn't take into consideration that my array has 20k urls. – user1647347 Sep 22 '12 at 20:49
  • You can use multi_curl requests for speed up this operation – Ozerich Sep 22 '12 at 20:51
  • can you provide an example please? I don't think it would be a good idea to multi_curl 20,000 at once. Maybe you can chunk them? – user1647347 Sep 22 '12 at 20:52
  • 1
    You can do it not in request script. You can perform this operation in background script and save result to DB, and at request script you need to select data from DB. It allow you refuse high speed – Ozerich Sep 22 '12 at 20:56
1

check out RollingCurl

It allows you to execute multiple curl requests. Here is an example:

    require 'curl/RollingCurl.php';
    require 'curl/RollingCurlGroup.php';


    $rc = new RollingCurl('handle_response');
            $rc->window_size = 2;




            foreach($domain_array as $domain => $value)
            {






                 $request = new RollingCurlRequest($value);

                // echo $temp . "\n";


                    $rc->add($request);




            }

            $rc->execute();




    function handle_response($response, $info)
    {

            if($info['http_code'] === 200)
            {
                // site exists handle response data
            }

    }
Ryan
  • 14,392
  • 8
  • 62
  • 102
  • this looks promising...I'll try it right now – user1647347 Sep 22 '12 at 20:58
  • Got this working, seems fast. The problem is the $rc->window_size; In his example it's set to 20 but that makes the script only process the first 20 domains. Seems it won't actually process batches – user1647347 Sep 22 '12 at 21:23
  • no the window size is the number of concurrent connections that will be executed at once. – Ryan Sep 23 '12 at 04:21
1

I think that if you really want to speed up the process and save a lot of bandwidth (as I got you plan to check the availability on a regular basis) then you should work with sockets, not with curl. You may open several sockets at time and arrange 'asynchronous' treatment of each socket. Then you need to send not the "GET $sitename/ HTTP/1.0\r\n\r\n" request but "HEAD $sitename/ HTTP/1.0\r\n\r\n". It will return the same status code as GET request would return but without response body. You need to parse only first row of response to get an answer, so you just could regex_match it with good response codes. And as one extra optimization, eventually your code will learn what sites are sitting on the same IPs, so you cache the name mappings and order the list by IP. Then you may check several sites over one connected socket for these sites (remember to add 'Connection: keep-alive' header).

Serge
  • 6,088
  • 17
  • 27
0

YOu can use multi curl requests, but you probably want to limit them to 10 at a time or so. You would have to track jobs in a separate database for processing the queue: Threads in PHP

Community
  • 1
  • 1
chovy
  • 72,281
  • 52
  • 227
  • 295