41

I am doing a simple app that reads json data from 15 different URLs. I have a special need that I need to do this serverly. I am using file_get_contents($url).

Since I am using file_get_contents($url). I wrote a simple script, is it:

$websites = array(
    $url1,
    $url2,
    $url3,
     ...
    $url15
);

foreach ($websites as $website) {
    $data[] = file_get_contents($website);
}

and it was proven to be very slow, because it waits for the first request and then do the next one.

hakre
  • 193,403
  • 52
  • 435
  • 836
user1205408
  • 419
  • 1
  • 5
  • 4
  • 3
    Google gives many results for "curl parallel requests" – matino Feb 16 '12 at 09:44
  • 1
    PHP is a single-threaded language, it doesn't have any kind of internal support for concurrency. You could write a script that fetches a single URL (supplied as an argument) and execute 15 instances of it. – GordonM Feb 16 '12 at 10:13
  • 1
    Thank you for all of your opinions. :) – user1205408 Feb 16 '12 at 14:23
  • 14
    In case anyone stumbles upon this page, GordonM's comment above is incorrect; the PHP curl library specifically supports multiple parallel requests. Apart from that, you can create fully multi-threaded PHP applications using the pthreads extension, though that is entirely unnecessary and overkill for this because the curl extension supports it simply. – thomasrutter Aug 04 '15 at 13:08

3 Answers3

148

If you mean multi-curl then, something like this might help:


$nodes = array($url1, $url2, $url3);
$node_count = count($nodes);

$curl_arr = array();
$master = curl_multi_init();

for($i = 0; $i < $node_count; $i++)
{
    $url =$nodes[$i];
    $curl_arr[$i] = curl_init($url);
    curl_setopt($curl_arr[$i], CURLOPT_RETURNTRANSFER, true);
    curl_multi_add_handle($master, $curl_arr[$i]);
}

do {
    curl_multi_exec($master,$running);
} while($running > 0);


for($i = 0; $i < $node_count; $i++)
{
    $results[] = curl_multi_getcontent  ( $curl_arr[$i]  );
}
print_r($results);

starball
  • 20,030
  • 7
  • 43
  • 238
Sudhir Bastakoti
  • 99,167
  • 15
  • 158
  • 162
  • 1
    may i know what $running contains? – ramya br Nov 06 '15 at 06:25
  • @ramyabr boolean (reference) if multicurl is still running and getting data. – Shlizer Feb 18 '16 at 08:57
  • your multi_exec loop *will work*, but it will also waste a shitton of cpu, using 100% CPU (of 1 core) until everything has been downloaded, because your loop is spamming curl_multi_exec(), an *async* function, as fast as possible, until everything is downloaded. if you change it to ```do {curl_multi_exec($master,$running);if($running>0){curl_multi_select($mh,1);}} while($running > 0);``` then it will use ~1% cpu instead of 100% cpu (a better loop can still be constructed though, this would be even better ```for(;;){curl_multi_exec($mh,$running);if($running<1)break;curl_multi_select($mh,1);}``` – hanshenrik Aug 18 '20 at 10:25
  • @DivyeshPrajapati it works great until you check how much CPU it's consuming, see my comment above ^^ – hanshenrik Aug 18 '20 at 10:59
  • @Shlizer that's incorrect, $running contains an int, the number of curl handles who still hasn't finished downloading the entire response (it's safe to use the variable as if it was a bool, though, because int(0)==false and int(>=1)==true , but the variable itself is int, not bool, and it can contain any number >= 0, like int(5) ) – hanshenrik Aug 18 '20 at 11:01
  • @hanshenrik didn't checked that but it definitely reduce request time... I was having 10 request simultaneously and each taking 3 seconds so in total it was taking around 25-30 seconds but after using this time reduced to 5-8 seconds – Divyesh Prajapati Aug 23 '20 at 15:22
6

i don't particularly like the approach of any of the existing answers

Timo's code: might sleep/select() during CURLM_CALL_MULTI_PERFORM which is wrong, it might also fail to sleep when ($still_running > 0 && $exec != CURLM_CALL_MULTI_PERFORM) which may make the code spin at 100% cpu usage (of 1 core) for no reason

Sudhir's code: will not sleep when $still_running > 0 , and spam-call the async-function curl_multi_exec() until everything has been downloaded, which cause php to use 100% cpu (of 1 cpu core) until everything has been downloaded, in other words it fails to sleep while downloading

here's an approach with neither of those issues:

$websites = array(
    "http://google.com",
    "http://example.org"
    // $url2,
    // $url3,
    // ...
    // $url15
);
$mh = curl_multi_init();
foreach ($websites as $website) {
    $worker = curl_init($website);
    curl_setopt_array($worker, [
        CURLOPT_RETURNTRANSFER => 1
    ]);
    curl_multi_add_handle($mh, $worker);
}
for (;;) {
    $still_running = null;
    do {
        $err = curl_multi_exec($mh, $still_running);
    } while ($err === CURLM_CALL_MULTI_PERFORM);
    if ($err !== CURLM_OK) {
        // handle curl multi error?
    }
    if ($still_running < 1) {
        // all downloads completed
        break;
    }
    // some haven't finished downloading, sleep until more data arrives:
    curl_multi_select($mh, 1);
}
$results = [];
while (false !== ($info = curl_multi_info_read($mh))) {
    if ($info["result"] !== CURLE_OK) {
        // handle download error?
    }
    $results[curl_getinfo($info["handle"], CURLINFO_EFFECTIVE_URL)] = curl_multi_getcontent($info["handle"]);
    curl_multi_remove_handle($mh, $info["handle"]);
    curl_close($info["handle"]);
}
curl_multi_close($mh);
var_export($results);

note that an issue shared by all 3 approaches here (my answer, and Sudhir's answer, and Timo's answer) is that they will open all connections simultaneously, if you have 1,000,000 websites to fetch, these scripts will try to open 1,000,000 connections simultaneously. if you need to like.. only download 50 websites at a time, or something like that, maybe try:

$websites = array(
    "http://google.com",
    "http://example.org"
    // $url2,
    // $url3,
    // ...
    // $url15
);
var_dump(fetch_urls($websites,50));
function fetch_urls(array $urls, int $max_connections, int $timeout_ms = 10000, bool $return_fault_reason = true): array
{
    if ($max_connections < 1) {
        throw new InvalidArgumentException("max_connections MUST be >=1");
    }
    foreach ($urls as $key => $foo) {
        if (! is_string($foo)) {
            throw new \InvalidArgumentException("all urls must be strings!");
        }
        if (empty($foo)) {
            unset($urls[$key]); // ?
        }
    }
    unset($foo);
    // DISABLED for benchmarking purposes: $urls = array_unique($urls); // remove duplicates.
    $ret = array();
    $mh = curl_multi_init();
    $workers = array();
    $work = function () use (&$ret, &$workers, &$mh, $return_fault_reason) {
        // > If an added handle fails very quickly, it may never be counted as a running_handle
        while (1) {
            do {
                $err = curl_multi_exec($mh, $still_running);
            } while ($err === CURLM_CALL_MULTI_PERFORM);
            if ($still_running < count($workers)) {
                // some workers finished, fetch their response and close them
                break;
            }
            $cms = curl_multi_select($mh, 1);
            // var_dump('sr: ' . $still_running . " c: " . count($workers)." cms: ".$cms);
        }
        while (false !== ($info = curl_multi_info_read($mh))) {
            // echo "NOT FALSE!";
            // var_dump($info);
            {
                if ($info['msg'] !== CURLMSG_DONE) {
                    continue;
                }
                if ($info['result'] !== CURLE_OK) {
                    if ($return_fault_reason) {
                        $ret[$workers[(int) $info['handle']]] = print_r(array(
                            false,
                            $info['result'],
                            "curl_exec error " . $info['result'] . ": " . curl_strerror($info['result'])
                        ), true);
                    }
                } elseif (CURLE_OK !== ($err = curl_errno($info['handle']))) {
                    if ($return_fault_reason) {
                        $ret[$workers[(int) $info['handle']]] = print_r(array(
                            false,
                            $err,
                            "curl error " . $err . ": " . curl_strerror($err)
                        ), true);
                    }
                } else {
                    $ret[$workers[(int) $info['handle']]] = curl_multi_getcontent($info['handle']);
                }
                curl_multi_remove_handle($mh, $info['handle']);
                assert(isset($workers[(int) $info['handle']]));
                unset($workers[(int) $info['handle']]);
                curl_close($info['handle']);
            }
        }
        // echo "NO MORE INFO!";
    };
    foreach ($urls as $url) {
        while (count($workers) >= $max_connections) {
            // echo "TOO MANY WORKERS!\n";
            $work();
        }
        $neww = curl_init($url);
        if (! $neww) {
            trigger_error("curl_init() failed! probably means that max_connections is too high and you ran out of system resources", E_USER_WARNING);
            if ($return_fault_reason) {
                $ret[$url] = array(
                    false,
                    - 1,
                    "curl_init() failed"
                );
            }
            continue;
        }
        $workers[(int) $neww] = $url;
        curl_setopt_array($neww, array(
            CURLOPT_RETURNTRANSFER => 1,
            CURLOPT_SSL_VERIFYHOST => 0,
            CURLOPT_SSL_VERIFYPEER => 0,
            CURLOPT_TIMEOUT_MS => $timeout_ms
        ));
        curl_multi_add_handle($mh, $neww);
        // curl_multi_exec($mh, $unused_here); LIKELY TO BE MUCH SLOWER IF DONE IN THIS LOOP: TOO MANY SYSCALLS
    }
    while (count($workers) > 0) {
        // echo "WAITING FOR WORKERS TO BECOME 0!";
        // var_dump(count($workers));
        $work();
    }
    curl_multi_close($mh);
    return $ret;
}

that will download the entire list and not download more than 50 urls simultaneously (but even that approach stores all the results in-ram, so even that approach may end up running out of ram; if you want to store it in a database instead of in ram, the curl_multi_getcontent part can be modified to store it in a database instead of in a ram-persistent variable.)

Ali Niaz
  • 312
  • 3
  • 10
hanshenrik
  • 19,904
  • 4
  • 43
  • 89
  • Could you please tell what does `$return_fault_reason` mount to? – Ali Niaz Feb 02 '21 at 09:07
  • @AliNiaz sorry forgot about that when copying the code from [this answer](https://stackoverflow.com/a/54717579/1067003), `$return_fault_reason` is supposed to be an argument telling if a failed download should just be ignored, or if a failed download should come with an error message; i updated the code with the `$return_fault_reason` argument now. – hanshenrik Feb 02 '21 at 17:59
0

I would like to provide a more complete example without hitting the CPU at 100% and crashing when there's a slight error or something unexpected.

It also shows you how to fetch the headers, the body, request info and manual redirect following.

Disclaimer, this code is intended to be extended and implemented into a library or as a quick starting point, and as such the functions inside of it are kept to a minimum.

function mtime(){
    return microtime(true);
}
function ptime($prev){
    $t = microtime(true) - $prev;
    $t = $t * 1000;
    return str_pad($t, 20, 0, STR_PAD_RIGHT);
}

// This function exists to add compatibility for CURLM_CALL_MULTI_PERFORM for old curl versions, on modern curl it will only run once and be the equivalent of calling curl_multi_exec
function curl_multi_exec_full($mh, &$still_running) {
    // In theory curl_multi_exec should never return CURLM_CALL_MULTI_PERFORM (-1) because it has been deprecated
    // In practice it sometimes does
    // So imagine that this just runs curl_multi_exec once and returns it's value
    do {
        $state = curl_multi_exec($mh, $still_running);

        // curl_multi_select($mh, $timeout) simply blocks for $timeout seconds while curl_multi_exec() returns CURLM_CALL_MULTI_PERFORM
        // We add it to prevent CPU 100% usage in case this thing misbehaves (especially for old curl on windows)
    } while ($still_running > 0 && $state === CURLM_CALL_MULTI_PERFORM && curl_multi_select($mh, 0.1));
    return $state;
}

// This function replaces curl_multi_select and makes the name make more sense, since all we're doing is waiting for curl, it also forces a minimum sleep time between requests to avoid excessive CPU usage.
function curl_multi_wait($mh, $minTime = 0.001, $maxTime = 1){
    $umin = $minTime*1000000;

    $start_time = microtime(true);

    // it sleeps until there is some activity on any of the descriptors (curl files)
    // it returns the number of descriptors (curl files that can have activity)
    $num_descriptors = curl_multi_select($mh, $maxTime);

    // if the system returns -1, it means that the wait time is unknown, and we have to decide the minimum time to wait
    // but our `$timespan` check below catches this edge case, so this `if` isn't really necessary
    if($num_descriptors === -1){
        usleep($umin);
    }

    $timespan = (microtime(true) - $start_time);

    // This thing runs very fast, up to 1000 times for 2 urls, which wastes a lot of CPU
    // This will reduce the runs so that each interval is separated by at least minTime
    if($timespan < $umin){
        usleep($umin - $timespan);
        //print "sleep for ".($umin - $timeDiff).PHP_EOL;
    }
}


$handles = [
    [
        CURLOPT_URL=>"http://example.com/",
        CURLOPT_HEADER=>false,
        CURLOPT_RETURNTRANSFER=>true,
        CURLOPT_FOLLOWLOCATION=>false,
    ],
    [
        CURLOPT_URL=>"http://www.php.net",
        CURLOPT_HEADER=>false,
        CURLOPT_RETURNTRANSFER=>true,
        CURLOPT_FOLLOWLOCATION=>false,

        // this function is called by curl for each header received
        // This complies with RFC822 and RFC2616, please do not suggest edits to make use of the mb_ string functions, it is incorrect!
        // https://stackoverflow.com/a/41135574
        CURLOPT_HEADERFUNCTION=>function($ch, $header)
        {
            print "header from http://www.php.net: ".$header;
            //$header = explode(':', $header, 2);
            //if (count($header) < 2){ // ignore invalid headers
            //    return $len;
            //}

            //$headers[strtolower(trim($header[0]))][] = trim($header[1]);

            return strlen($header);
        }
    ]
];




//create the multiple cURL handle
$mh = curl_multi_init();

$chandles = [];
foreach($handles as $opts) {
    // create cURL resources
    $ch = curl_init();

    // set URL and other appropriate options
    curl_setopt_array($ch, $opts);

    // add the handle
    curl_multi_add_handle($mh, $ch);

    $chandles[] = $ch;
}


//execute the multi handle
$prevRunning = null;
$count = 0;
do {
    $time = mtime();

    // $running contains the number of currently running requests
    $status = curl_multi_exec_full($mh, $running);
    $count++;

    print ptime($time).": curl_multi_exec status=$status running $running".PHP_EOL;

    // One less is running, meaning one has finished
    if($running < $prevRunning){
        print ptime($time).": curl_multi_info_read".PHP_EOL;

        // msg: The CURLMSG_DONE constant. Other return values are currently not available.
        // result: One of the CURLE_* constants. If everything is OK, the CURLE_OK will be the result.
        // handle: Resource of type curl indicates the handle which it concerns.
        while ($read = curl_multi_info_read($mh, $msgs_in_queue)) {

            $info = curl_getinfo($read['handle']);

            if($read['result'] !== CURLE_OK){
                // handle the error somehow
                print "Error: ".$info['url'].PHP_EOL;
            }

            if($read['result'] === CURLE_OK){
                /*
                // This will automatically follow the redirect and still give you control over the previous page
                // TODO: max redirect checks and redirect timeouts
                if(isset($info['redirect_url']) && trim($info['redirect_url'])!==''){

                    print "running redirect: ".$info['redirect_url'].PHP_EOL;
                    $ch3 = curl_init();
                    curl_setopt($ch3, CURLOPT_URL, $info['redirect_url']);
                    curl_setopt($ch3, CURLOPT_HEADER, 0);
                    curl_setopt($ch3, CURLOPT_RETURNTRANSFER, 1);
                    curl_setopt($ch3, CURLOPT_FOLLOWLOCATION, 0);
                    curl_multi_add_handle($mh,$ch3);
                }
                */

                print_r($info);
                $body = curl_multi_getcontent($read['handle']);
                print $body;
            }
        }
    }

    // Still running? keep waiting...
    if ($running > 0) {
        curl_multi_wait($mh);
    }

    $prevRunning = $running;

} while ($running > 0 && $status == CURLM_OK);

//close the handles
foreach($chandles as $ch){
    curl_multi_remove_handle($mh, $ch);
}
curl_multi_close($mh);

print $count.PHP_EOL;
Timo Huovinen
  • 53,325
  • 33
  • 152
  • 143
  • your multi_exec() loop makes no sense and will always exit on the first row... if you absolutely insist on supporting CURLM_CALL_MULTI_PERFORM (which was deprecated from curl since at least 2012 and not used anymore), the the loop should be like: ```for (;;) { do { $ex = curl_multi_exec($mh, $still_running); } while ($ex === CURLM_CALL_MULTI_PERFORM); if ($ex !== CURLM_OK) { /*handle curl error?*/ } if ($still_running < 1) { break; } curl_multi_select($mh, 1); }``` – hanshenrik Aug 18 '20 at 10:15
  • your code is handling `CURLM_CALL_MULTI_PERFORM` (hence CCMP) wrong, you're not supposed to run select() if you get CCMP, you're supposed to call multi_exec() again if you get CCMP, but worse, as of (2012ish?) curl never returns CCMP anymore, so your `$state === CCMP ` check will _always_ fail, meaning your exec loop will *always* exit after the first iteration – hanshenrik Aug 18 '20 at 10:43
  • My original reasoning was to add it as backwards compatibility for older versions of curl (pre 2012) and it's ok if it just exists the loop immediately. That's also why I packaged it into `curl_multi_exec_full`, which can be renamed to `curl_multi_exec` for post 2012 compatibility. CCMP will select and exec again. I really do appreciate your comment and would like some more reasoning why the code is wrong, right now I'm not seeing the error. – Timo Huovinen Aug 19 '20 at 11:28
  • for one: you run select() if you get CCMP, that's wrong. you're not supposed to wait for more data to arrive if you get CCMP. it means you're immediately supposed to run curl_multi_exec() if you get CCMP (it allows for programs that needs very low latency/realtime-systems to do other stuff if a single multi_exec() used too much cpu/time, but so many people didn't understand how to use it correctly that the curl devs decided to deprecate it: too many got it wrong, and very few people actually needed it. on the curl mailing list there was only 1 person that complained and actually used it) – hanshenrik Aug 19 '20 at 14:35
  • two: you never run select() if you don't get CCMP, but that's also wrong, sometimes (in these days, *OFTEN*) you're supposed to run select() even if you don't get CCMP, but your code doesn't. – hanshenrik Aug 19 '20 at 14:35
  • here is how i think the function should look like: https://3v4l.org/1iaqm – hanshenrik Aug 19 '20 at 14:39
  • @hanshenrik When I read the documentation (I don't remember where it is) it said that select didn't do anything besides adding wait time while CCMP, which was actually required for Windows, otherwise it would hit the 100% cpu mark on old Curls, so if I remove the select I would be breaking it for pre 2012 curl on windows. I do run select, it's inside the `curl_multi_wait` function, notice that it counts process completion one process at a time lower down the code, meaning that we don't care that `curl_multi_exec_full` just finished in one loop or runs select, which it won't on new curl – Timo Huovinen Aug 19 '20 at 15:40
  • @hanshenrik `do { $state = curl_multi_exec($mh, $still_running); } while ($state === CURLM_CALL_MULTI_PERFORM);` hits 100% cpu and I'm pretty sure is a bug (especially on windows), the select is basically a timeout to prevent 100% cpu. Remember that `do` will hit 100% cpu unless there's a sleep in there. – Timo Huovinen Aug 19 '20 at 15:41
  • @hanshenrik also notice that I'm capturing the response as it completes (unlike your example), not when all of them have been completed, allowing me to do manual redirects with minimal time loss. I can even inject additional requests after multi has been started. – Timo Huovinen Aug 19 '20 at 15:51
  • that loop hits 100% cpu when there's more data downloaded and ready to be fetched, as it should do. funny thing about this script: https://3v4l.org/eaHCl if you run it on MS Windows's cmd.exe (on a fast 50mbit connection, at least) it will actually use 100% cpu, from CMD.exe - cmd is very slow at receiving null bytes, and it's being bombarded with 50mbit worth of null bytes every second. but if you run it on a cygwin terminal, or a linux termminal, or if you run it as `php foo.php > NUL` (NUL is windows's /dev/null, but in Windows it's in every folder), it uses ~1% CPU , try it yourself :P – hanshenrik Aug 19 '20 at 15:57
  • @hanshenrik Interesting, I did not know that. To clarify, I added the `curl_multi_select` into CCMP to cope with an old windows bug, where the select acts as a kind of sleep. I'm a bit worried that removing it will make it less "robust", but I'm ok with that. – Timo Huovinen Aug 19 '20 at 15:59
  • @hanshenrik what's the harm in keeping `curl_multi_select` for CCMP in `curl_multi_exec_full`? – Timo Huovinen Aug 19 '20 at 16:02
  • CCMP means "more data is ready to be read now, you should run read() now, it will not block" - and then your code proceed to.. run select() (instead of read()) and wait for even more data to arrive, instead of read()ing - if the next data comes slowly, or if some buffers are full and waiting to be read, i'm assuming it can slow down the code (waiting on select() when you should be read()'ing ) – hanshenrik Aug 19 '20 at 19:30