1

Im having a problem retrieving json data using multi curl from several url generated from database, If I limit the query from 100 to 500 links the issue will not occur, but when the links reaches 1000+, im starting to get random NULL return from curl_multi_getcontent.

The multi curl function:

function curlMultiExec($nodes)
{     
  $node_count = count($nodes);     
  $ch_arr = array();

  $master = curl_multi_init();

  for($i = 0; $i < $node_count; $i++)
  {
      $url        = $nodes[$i]['url'];
      $ch_arr[$i] = curl_init($url);

      curl_setopt($ch_arr[$i], CURLOPT_RETURNTRANSFER, TRUE);  
      curl_setopt($ch_arr[$i], CURLOPT_BINARYTRANSFER, TRUE);
      curl_setopt($ch_arr[$i], CURLOPT_FOLLOWLOCATION, TRUE);  
      curl_setopt($ch_arr[$i], CURLOPT_AUTOREFERER,    TRUE); 
      curl_setopt($ch_arr[$i], CURLOPT_HEADER,         FALSE);
      curl_multi_add_handle($master, $ch_arr[$i]);
  }

  $running = null;
  do 
  {
    curl_multi_exec($master,$running);
  } while( $running > 0 );


  $obj = array();
  for($i = 0; $i < $node_count; $i++ )
  {
       $item = array(              
           'url'        => $nodes[$i]['url'],
           'content'    => curl_multi_getcontent($ch_arr[$i])              
        ); 
        array_push($obj, $item);       
}        
curl_multi_close($master);  
return $obj;
}

Currently the $nodes[$i]['url'] contains 1,912 url.

The output using print_r

Array
    (
       [0] => Array
       (
        [url] => http://api.worldbank.org/countries/AFG/indicators/NY.GDP.MKTP.CD?per_page=100&date=1960:2014&format=json
        [content] => [{ /* json data */ }]
    )

[1] => Array
  (
        [url] => http://api.worldbank.org/countries/ALB/indicators/NY.GDP.MKTP.CD?per_page=100&date=1960:2014&format=json
        [content] =>   // -> here's the sample null value
  )

  .
  .  //-> and some random [content] here also contains null value 
  .
  .
  [1191] => Array
  (
        [url] => http://api.worldbank.org/countries/ZWE/indicators/NY.GDP.MKTP.CD?per_page=100&date=1960:2014&format=json
        [content] => [{ /* json data */ }]
  )
)

Kindly enlightened me why it returns random null value, Or what causes this behavior or is there a better approach than this?


UPDATE (2014-02-20); I found the solution here : curl_multi() without blocking

The problem is that most implementations of curl_multi wait for each set of requests to complete before processing them. If there are too many requests to process at once, they usually get broken into groups that are then processed one at a time.

The solution is to process each request as soon as it completes. This eliminates the wasted CPU cycles from busy waiting.

Based on his approach, I successfully solve this. Maybe this post can help, just in case someone stumble the same issue.

Cheers!

1mr3yn
  • 74
  • 1
  • 9
  • 1
    Do it in batches then.... – Lawrence Cherone Feb 19 '14 at 11:41
  • 1
    maybe theres a limitation on the API-Side, you could add a sleep after 500 requests or sth like this if this is the problem – john Smith Feb 19 '14 at 11:42
  • I read a blog that says: 'The problem is that most implementations of curl_multi wait for each set of requests to complete before processing them. If there are too many requests to process at once, they usually get broken into groups that are then processed one at a time.' Does curl_multi really behave this way? – 1mr3yn Feb 20 '14 at 02:29

1 Answers1

1

1000 simultaneous connections will easily make you reach the common limit for maximum number of open files/sockets: 1024. It could be what you hit here.

Related question on how to change the limit: How do I change the number of open files limit in Linux?

Community
  • 1
  • 1
Daniel Stenberg
  • 54,736
  • 17
  • 146
  • 222