1

I'm trying to take a list of 20,000 + domain names and check if they are "alive". All I really need is a simple http code check but I can't figure out how to get that working with curl_multi. On a separate script I'm using I have the following function which simultaneously checks a batch of 1000 domains and returns the json response code. Maybe this can be modified to just get the http response code instead of the page content?

(sorry about the syntax I couldn't get it to paste as a nice block of code without going line by line and adding 4 spaces...(also tried skipping a line and adding 8 spaces)

$dotNetRequests = array of domains...

//loop through arrays
foreach(array_chunk($dotNetRequests, 1000) as $Netrequests) {
    $results = checkDomains($Netrequests);
    $NetcurlRequest = array_merge($NetcurlRequest, $results);
}

function checkDomains($data) {

// array of curl handles
$curly = array();
// data to be returned
$result = array();

// multi handle
$mh = curl_multi_init();

// loop through $data and create curl handles
// then add them to the multi-handle
foreach ($data as $id => $d) {

$curly[$id] = curl_init();

$url = (is_array($d) && !empty($d['url'])) ? $d['url'] : $d;
curl_setopt($curly[$id], CURLOPT_URL,            $url);
curl_setopt($curly[$id], CURLOPT_HEADER,         0);
curl_setopt($curly[$id], CURLOPT_RETURNTRANSFER, 1);

// post?
if (is_array($d)) {
  if (!empty($d['post'])) {
    curl_setopt($curly[$id], CURLOPT_POST,       1);
    curl_setopt($curly[$id], CURLOPT_POSTFIELDS, $d['post']);
  }
}

curl_multi_add_handle($mh, $curly[$id]);
  }

  // execute the handles
  $running = null;
  do {
    curl_multi_exec($mh, $running);
  } while($running > 0);

  // get content and remove handles
  foreach($curly as $id => $c) {
     // $result[$id] = curl_multi_getcontent($c);
// if($result[$id]) {
if (curl_multi_getcontent($c)){
    //echo "yes";
    $netName = $data[$id];
    $dName = str_replace(".net", ".com", $netName);
    $query = "Update table1 SET dotnet = '1' WHERE Domain = '$dName'";
    mysql_query($query);
}
curl_multi_remove_handle($mh, $c); 
}

// all done
 curl_multi_close($mh);

return $result;
} 
user1647347
  • 507
  • 1
  • 8
  • 23

2 Answers2

1

In any other language you would thread this kind of operation ...

https://github.com/krakjoe/pthreads

And you can in PHP too :)

I would suggest a few workers rather than 20,000 individual threads ... not that 20,000 threads is out of the realms of possibility - it isn't ... but that wouldn't be a good use of resources, I would do as you are now and have 20 workers getting the results of 1000 domains each ... I assume you don't need me to give the example of getting a response code, I'm sure curl would give it to you, but it's probably overkill to use curl being that you do not require it's threading capabilities: I would fsockopen port 80, fprintf GET HTTP/1.0/\n\n, fgets the first line and close the connection ... if you're going to be doing this all the time then I would also use Connection: close so that the receiving machines are not holding connections unnecessary ...

Joe Watkins
  • 17,032
  • 5
  • 41
  • 62
  • Sounds like a good plan. I have a cURL function I'm using to check the http response code but the threading was the issue. I'll try your link and see how far I can get :) – user1647347 Sep 20 '12 at 16:33
  • I thought I'd save you some time, the pages you need to read from the wiki to get going are 1,2 and 7 ( always start at the beginning ) ... the rest you can read while your !! PHP threads !!! are running ... – Joe Watkins Sep 20 '12 at 16:37
  • awesome thanks! All seems to add up but I'm not sure where the worker threads come into play? Do you think I need them for this particular situation? End goal is really to take 20k domains and check whether they are up or down in as little time as possible. – user1647347 Sep 20 '12 at 19:35
  • Like I said, creating 20,000 threads is possible, but it's not a very good use of resources. The speed depends on those resources, will your network hardware keep up, will your server keep up. I think in most cases unless you have google-like computing power it will be more efficient ( including faster ) to execute workers in groups and allow the machine to keep doing it's job at the same time ... – Joe Watkins Sep 20 '12 at 21:40
0

This script works great for handling bulk simultaneous cURL requests using PHP. I'm able to parse through 50k domains in just a few minutes using it!

https://github.com/petewarden/ParallelCurl/

user1647347
  • 507
  • 1
  • 8
  • 23