6

What is correct example (up-to-date approach) use CURL-MULTI? I use the below code, but many times, it fails to get the content (returns empty result, and neither I have experience how to retrieve the correct repsonse/error):

public function multi_curl($urls)
{          
    $AllResults =[]; 
    $mch = curl_multi_init();
    $handlesArray=[];
    $curl_conn_timeout= 3 *60; //max 3 minutes
    $curl_max_timeout = 30*60; //max 30 minutes

    foreach ($urls as $key=> $url) {
        $ch = curl_init();  
        curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
        curl_setopt($ch, CURLOPT_FOLLOWLOCATION, 1);
        curl_setopt($ch, CURLOPT_HEADER, false);
        // timeouts: https://thisinterestsme.com/php-setting-curl-timeout/   and https://stackoverflow.com/a/15982505/2377343
        curl_setopt($ch, CURLOPT_CONNECTTIMEOUT, $curl_conn_timeout);
        curl_setopt($ch, CURLOPT_TIMEOUT, $curl_max_timeout);
        if (defined('CURLOPT_TCP_FASTOPEN')) curl_setopt($ch, CURLOPT_TCP_FASTOPEN, 1);
        curl_setopt($ch, CURLOPT_ENCODING, ""); // empty to autodetect | gzip,deflate
        curl_setopt($ch, CURLOPT_SSL_VERIFYPEER, FALSE);
        curl_setopt($ch, CURLOPT_URL, $url);
        $handlesArray[$key] = $ch;
        curl_multi_add_handle($mch, $handlesArray[$key]);
    }
   
    // other approaches are deprecated ! https://stackoverflow.com/questions/58971677/
    do {
        $execReturnValue = curl_multi_exec($mch, $runningHandlesAmount);
        usleep(100); // stop 100 microseconds to avoid infinity speed recursion
    } while ($runningHandlesAmount>0);
   
    //exec now
    foreach($urls as $key => $url)
    {
        $AllResults[$key]['url'] =$url;
        $handle = $handlesArray[$key];
        // Check for errors
        $curlError = curl_error($handle);
        if ($curlError!="")
        {
            $AllResults[$key]['error']    =$curlError;
            $AllResults[$key]['response'] =false;
        }
        else {
            $AllResults[$key]['error']    =false;
            $AllResults[$key]['response'] =curl_multi_getcontent($handle);
        }
        curl_multi_remove_handle($mch, $handle); curl_close($handle);
    }
    curl_multi_close($mch);
    return $AllResults;
}

and executing:

$urls = [ 'https://baconipsum.com/api/?type=meat-and-filler',
          'https://baconipsum.com/api/?type=all-meat&paras=2'];

$results = $helpers->multi_curl($urls);

Is there something, that can be changed, to have better results?


update: I've found this repository also mentions the lack of documentation about the best-use-case for multi-curl and provides their approach. However, I ask this on SO to get other competent answers too.

T.Todua
  • 53,146
  • 19
  • 236
  • 237

3 Answers3

7

I use the below code

that code has issues:

  • it has NO connection cap, if you try to open 1 million urls simultaneously, it will try to create 1 million tcp connections at once (many websites will block you as a TCP DDoS around 100!)
  • it doesn't even verify that it was able to create the curl easy handles (which it definitely won't be able to do if it has too many urls, see the first issue)
  • it sleeps for 100 microseconds, which may be 100 microseconds longer than required, it's supposed to use select() to let the OS tell it exactly when the data has arrived/been-sent, not wait 100 us (with curl_multi_select())
  • doesn't detect transfer errors..
  • (optimization nitpicking) it doesn't fetch any workers data until every single worker has finished, an optimized implementation would drain completed workers while still-working-workers would be transferring simultaneously..
  • (optimization-nitpicking) it doesn't re-use handles
  • (optimization nitpicking) it doesn't remove completed workers from the multi_list until every single worker has finished, which use more cpu in every curl_multi_exec call (because mutli_exec has to iterate even the finished workers that are still in the list)

this implementation should be significantly faster, has a configurable limit on max simultaneous connections, re-use curl handles, removes completed workers asap, detect curl_multi errors, etc


/**
 * fetch all urls in parallel,
 * warning: all urls must be unique..
 *
 * @param array $urls_unique
 *            urls to fetch
 * @param int $max_connections
 *            (optional, default 100) max simultaneous connections
 *            (some websites will auto-ban you for "ddosing" if you send too many requests simultaneously,
 *            and some wifi routers will get unstable on too many connectionis.. )
 * @param array $additional_curlopts
 *            (optional) set additional curl options here, each curl handle will get these options
 * @throws RuntimeException on curl_multi errors
 * @throws RuntimeException on curl_init() / curl_setopt() errors
 * @return array(url=>response,url2=>response2,...)
 */
function curl_fetch_multi_2(array $urls_unique, int $max_connections = 100, array $additional_curlopts = null)
{
    // $urls_unique = array_unique($urls_unique);
    $ret = array();
    $mh = curl_multi_init();
    // $workers format: [(int)$ch]=url
    $workers = array();
    $max_connections = min($max_connections, count($urls_unique));
    $unemployed_workers = array();
    for ($i = 0; $i < $max_connections; ++ $i) {
        $unemployed_worker = curl_init();
        if (! $unemployed_worker) {
            throw new \RuntimeException("failed creating unemployed worker #" . $i);
        }
        $unemployed_workers[] = $unemployed_worker;
    }
    unset($i, $unemployed_worker);

    $work = function () use (&$workers, &$unemployed_workers, &$mh, &$ret): void {
        assert(count($workers) > 0, "work() called with 0 workers!!");
        $number_of_curl_handles_still_running = null;
        for (;;) {
            do {
                $err = curl_multi_exec($mh, $number_of_curl_handles_still_running);
            } while ($err === CURLM_CALL_MULTI_PERFORM);
            if ($err !== CURLM_OK) {
                $errinfo = [
                    "multi_exec_return" => $err,
                    "curl_multi_errno" => curl_multi_errno($mh),
                    "curl_multi_strerror" => curl_multi_strerror($err)
                ];
                $errstr = "curl_multi_exec error: " . str_replace([
                    "\r",
                    "\n"
                ], "", var_export($errinfo, true));
                throw new \RuntimeException($errstr);
            }
            if ($number_of_curl_handles_still_running < count($workers)) {
                // some workers has finished downloading, process them
                // echo "processing!";
                break;
            } else {
                // no workers finished yet, sleep-wait for workers to finish downloading.
                // echo "select()ing!";
                curl_multi_select($mh, 1);
                // sleep(1);
            }
        }
        while (false !== ($info = curl_multi_info_read($mh))) {
            if ($info['msg'] !== CURLMSG_DONE) {
                // no idea what this is, it's not the message we're looking for though, ignore it.
                continue;
            }
            if ($info['result'] !== CURLM_OK) {
                $errinfo = [
                    "effective_url" => curl_getinfo($info['handle'], CURLINFO_EFFECTIVE_URL),
                    "curl_errno" => curl_errno($info['handle']),
                    "curl_error" => curl_error($info['handle']),
                    "curl_multi_errno" => curl_multi_errno($mh),
                    "curl_multi_strerror" => curl_multi_strerror(curl_multi_errno($mh))
                ];
                $errstr = "curl_multi worker error: " . str_replace([
                    "\r",
                    "\n"
                ], "", var_export($errinfo, true));
                throw new \RuntimeException($errstr);
            }
            $ch = $info['handle'];
            $ch_index = (int) $ch;
            $url = $workers[$ch_index];
            $ret[$url] = curl_multi_getcontent($ch);
            unset($workers[$ch_index]);
            curl_multi_remove_handle($mh, $ch);
            $unemployed_workers[] = $ch;
        }
    };
    $opts = array(
        CURLOPT_URL => '',
        CURLOPT_RETURNTRANSFER => 1,
        CURLOPT_ENCODING => ''
    );
    if (! empty($additional_curlopts)) {
        // i would have used array_merge(), but it does scary stuff with integer keys.. foreach() is easier to reason about
        foreach ($additional_curlopts as $key => $val) {
            $opts[$key] = $val;
        }
    }
    foreach ($urls_unique as $url) {
        while (empty($unemployed_workers)) {
            $work();
        }
        $new_worker = array_pop($unemployed_workers);
        $opts[CURLOPT_URL] = $url;
        if (! curl_setopt_array($new_worker, $opts)) {
            $errstr = "curl_setopt_array failed: " . curl_errno($new_worker) . ": " . curl_error($new_worker) . " " . var_export($opts, true);
            throw new RuntimeException($errstr);
        }
        $workers[(int) $new_worker] = $url;
        curl_multi_add_handle($mh, $new_worker);
    }
    while (count($workers) > 0) {
        $work();
    }
    foreach ($unemployed_workers as $unemployed_worker) {
        curl_close($unemployed_worker);
    }
    curl_multi_close($mh);
    return $ret;
}
hanshenrik
  • 19,904
  • 4
  • 43
  • 89
  • @T.Todua yes curl_multi_select there is used to sleep until there is more work to be done. when the OS detect that more data is available to be fetched from the socket, the OS will inform curl_multi_select, and curl_multi_select will wake up. this might take 100 microseconds, or it may take 10 microseconds, or it may take 100_000 microseconds, whatever the case may be; (it's slightly more complicated than that, curl_multi_select will also wake up if a maxed-uploading-buffer has its buffer cleaned on file upload/CURLOPT_POST, or if a connection was closed by the remote host, etc) – hanshenrik May 27 '21 at 12:33
  • In this code `still_running` is a bool but compared to `count(workers)` which is an int – William Entriken Feb 17 '23 at 02:19
  • 1
    @WilliamEntriken wrong, it's `int $number_of_curl_handles_still_running` and if there are more workers than running workers, it means some of them finished and should be processed :) – hanshenrik Feb 17 '23 at 10:51
  • 1
    @WilliamEntriken updated variable name to be less confusing, `$still_running` => `$number_of_curl_handles_still_running` – hanshenrik Feb 17 '23 at 10:59
3

I would highly recommend looking into the Guzzle library.

It allows you to perform asynchronous curl requests in an object oriented way.

Basic example:

class ExampleASyncRequester
{
    private Client $client;
    private array $responses;
    private array $urlsToRequest;
    private array $requestPromises;

    public function __construct(array $urlsToRequest)
    {
        $this->client = new Client();
        $this->responses = [];
        $this->requestPromises = [];
        $this->urlsToRequest = $urlsToRequest;
    }

    public function doRequests(): void
    {
        foreach ($this->urlsToRequest as $urlToRequest) {
            $promise = $this->client->getAsync($urlToRequest);
            // When we get a response, add it to our array
            $promise->then(
                function(ResponseInterface $response) {
                    $this->responses[] = $response;
                }
            );
            $this->requestPromises[] = $promise;
        }
        // Wait for all of the promises to either succeed or fail
        Utils::settle($this->requestPromises)->wait();
    }

    public function getResponses(): array
    {
        return $this->responses;
    }
}

$requestInstance = new ExampleASyncRequester([
    'https://www.google.com',
    'https://www.google.com',
    'https://www.google.com',
    'https://www.google.com',
    'https://www.google.com',
]);
$requestInstance->doRequests();

// For loop through our responses and dump the bodies
/** @var ResponseInterface $response */
foreach ($requestInstance->getResponses() as $response) {
    var_dump($response->getBody()->getContents());
}
  • Matthew, thanks for the answer. I believe this (Guzzle) would be more performant compared to multi-curl, and I will use it for my personal projects, however, in the exact specific case (because of the project specs), i had to stick with native curl. Otherwise, I would have awarded your useful answer. – T.Todua May 27 '21 at 17:10
  • @T.Todua actually when done correctly, curl_multi code should be faster than Guzzle (but it's *\*much\** easier to make some performance-impacting mistake in curl_multi than doing the same in Guzzle) – hanshenrik May 27 '21 at 18:52
  • @T.Todua for example, Guzzle does not implement curl handle re-use optimizations, but my answer below does implement curl handle re-use, my curl code should use less cpu than Guzzle on large lists of urls for that reason alone. – hanshenrik May 27 '21 at 18:54
  • @hanshenrik Thanks mate, I've told other guys too about your excellent answer/function. You can even put that on Github as a standalone class for public (like these guys did: https://github.com/joshfraser/rolling-curl ). btw, can you mention two words about that repo too? – T.Todua May 27 '21 at 18:59
1

Here is a slight change to @hanshenrik's great answer.

/**
 * Parallel fetch
 * 
 * @param array    $urls       Every url to operate on
 * @param callable $callback   Callback to call for every url (url, body, info)
 * @param int      $numWorkers Maximum number of connections to use
 * @param array    $curlOpts   Options to pass to curl
 */
function curl_fetch_multi_3($urls, $callback, $numWorkers = 10, $curlOpts = [])
{
    // Init multi handle and workers
    $multiHandle = curl_multi_init();
    $numWorkers = min($numWorkers, count($urls));
    $numEmployedWorkers = 0;
    $unemployedWorkers = [];
    for ($i = 0; $i < $numWorkers; ++ $i) {
        $unemployedWorker = curl_init();
        if ($unemployedWorker === false) {
            throw new \RuntimeException('Failed to init unemployed worker #' . $i);
        }
        if (!empty($curlOpts)) {
            curl_setopt_array($unemployedWorker, $curlOpts);
        }
        $unemployedWorkers[] = $unemployedWorker;
    }
    unset($i, $unemployedWorker);

    // Process some workers, results in some workers being moved to $unemployedWorkers
    $work = function () use (&$numEmployedWorkers, &$unemployedWorkers, &$multiHandle, $callback): void {
        assert($numEmployedWorkers > 0, 'work() called when no employed workers!!');
        for (;;) {
            $stillRunning = 0;
            do {
                $result = curl_multi_exec($multiHandle, $stillRunning);
            } while ($result === CURLM_CALL_MULTI_PERFORM);
            if ($result !== CURLM_OK) {
                throw new \RuntimeException('curl_multi_exec error: ' . curl_multi_strerror($result));
            }
            if ($stillRunning < $numEmployedWorkers) {                
                // PHP documentation for still_running is wrong, see https://curl.se/libcurl/c/curl_multi_perform.html
                // Some worker(s) finished downloading, process them
                break;
            }
            // No workers finished yet, select-wait for worker(s) to finish downloading.
            curl_multi_select($multiHandle, 1);
        }
        while (false !== ($info = curl_multi_info_read($multiHandle))) {
            if ($info['msg'] !== CURLMSG_DONE) {
                // Per https://curl.se/libcurl/c/curl_multi_info_read.html, no other message types are now possible
                continue;
            }
            if ($info['result'] !== CURLM_OK) {
                // PHP docs say to use CURLE_OK here, ignoring docs and using CURLM_OK instead
                throw new \RuntimeException('curl_multi worker error: ' . curl_multi_strerror($info['result']));
            }
            $curlHandle = $info['handle'];
            $body = curl_multi_getcontent($curlHandle);
            $curlInfo = curl_getinfo($curlHandle);
            $url = curl_getinfo($curlHandle, CURLINFO_PRIVATE);
            $callback($url, $body, $curlInfo);
            $numEmployedWorkers -= 1;
            curl_multi_remove_handle($multiHandle, $curlHandle);
            $unemployedWorkers[] = $curlHandle;
        }
    };
    
    // Main loop
    $opts = [
        CURLOPT_RETURNTRANSFER => true,
        CURLOPT_ENCODING => '',
    ];
    foreach ($urls as $url) {
        if (empty($unemployedWorkers)) {
            $work(); // Postcondition: $unemployedWorkers is not empty
        }
        $newWorker = array_pop($unemployedWorkers);
        $opts[CURLOPT_URL] = $url;
        $opts[CURLOPT_PRIVATE] = $url;
        $result = curl_setopt_array($newWorker, $opts);
        if ($result === false) {
            throw new \RuntimeException('curl_setopt_array error: ' . curl_error($newWorker));
        }
        $numEmployedWorkers += 1;
        $result = curl_multi_add_handle($multiHandle, $newWorker);
        if ($result === false) {
            throw new \RuntimeException('curl_multi_add_handle error: ' . curl_error($newWorker));
        }
    }
    while ($numEmployedWorkers > 0) {
        $work();
    }
    foreach ($unemployedWorkers as $unemployedWorker) {
        curl_close($unemployedWorker);
    }
    curl_multi_close($multiHandle);
}

Main improvements are:

  1. Use CURLOPT_PRIVATE, which obviates using an array to track URLs
  2. Link to relevant documentation/gatchas
William Entriken
  • 37,208
  • 23
  • 149
  • 195