2

I need to open around 100,000 URLS per day so that the images and html are cached into Cloudflare as the content changes fairly frequently.

I suspect that Curl will probably perform faster than a headless browser (chrome headless via puppeteer)

Does anyone have any experience with this or are there better ways of doing it?

vsemozhebuty
  • 12,992
  • 1
  • 26
  • 26
Ryan NZ
  • 616
  • 1
  • 9
  • 25
  • I am not entirely sure, but why not run some tests locally using your equipment and measure what performs better on your hardware/network? – reZach Feb 14 '19 at 20:37
  • i am completely confident that [libcurl's curl_multi api](https://curl.haxx.se/libcurl/c/libcurl-multi.html) is significantly faster than a headless browser. – hanshenrik Feb 15 '19 at 21:35
  • did you benchmark your headless browser against curl_multi? what numbers did you get? i have a passion for benchmarks – hanshenrik Feb 16 '19 at 09:09

2 Answers2

5

first off, i am confident that libcurl's curl_multi api is significantly faster than a headless browser. even if running under PHP (which is a much slower language than say C), i recon it would be faster than a headless browser, but let's put it to the test, benchmarking it using the code from https://stackoverflow.com/a/54353191/1067003 ,

benchmark this PHP script (using php's curl_multi api, which is a wrapper around libcurl's curl_multi api)

<?php
declare(strict_types=1);
$urls=array();
for($i=0;$i<100000;++$i){
    $urls[]="http://ratma.net/";
}
validate_urls($urls,500,1000,false,false,false);    
// if return_fault_reason is false, then the return is a simple array of strings of urls that validated.
// otherwise it's an array with the url as the key containing  array(bool validated,int curl_error_code,string reason) for every url
function validate_urls(array $urls, int $max_connections, int $timeout_ms = 10000, bool $consider_http_300_redirect_as_error = true, bool $return_fault_reason) : array
{
    if ($max_connections < 1) {
        throw new InvalidArgumentException("max_connections MUST be >=1");
    }
    foreach ($urls as $key => $foo) {
        if (!is_string($foo)) {
            throw new \InvalidArgumentException("all urls must be strings!");
        }
        if (empty($foo)) {
            unset($urls[$key]); //?
        }
    }
    unset($foo);
    // DISABLED for benchmarking purposes: $urls = array_unique($urls); // remove duplicates.
    $ret = array();
    $mh = curl_multi_init();
    $workers = array();
    $work = function () use (&$ret, &$workers, &$mh, &$return_fault_reason) {
        // > If an added handle fails very quickly, it may never be counted as a running_handle
        while (1) {
            curl_multi_exec($mh, $still_running);
            if ($still_running < count($workers)) {
                break;
            }
            $cms=curl_multi_select($mh, 10);
            //var_dump('sr: ' . $still_running . " c: " . count($workers)." cms: ".$cms);
        }
        while (false !== ($info = curl_multi_info_read($mh))) {
            //echo "NOT FALSE!";
            //var_dump($info);
            {
                if ($info['msg'] !== CURLMSG_DONE) {
                    continue;
                }
                if ($info['result'] !== CURLM_OK) {
                    if ($return_fault_reason) {
                        $ret[$workers[(int)$info['handle']]] = array(false, $info['result'], "curl_exec error " . $info['result'] . ": " . curl_strerror($info['result']));
                    }
                } elseif (CURLE_OK !== ($err = curl_errno($info['handle']))) {
                    if ($return_fault_reason) {
                        $ret[$workers[(int)$info['handle']]] = array(false, $err, "curl error " . $err . ": " . curl_strerror($err));
                    }
                } else {
                    $code = (string)curl_getinfo($info['handle'], CURLINFO_HTTP_CODE);
                    if ($code[0] === "3") {
                        if ($consider_http_300_redirect_as_error) {
                            if ($return_fault_reason) {
                                $ret[$workers[(int)$info['handle']]] = array(false, -1, "got a http " . $code . " redirect, which is considered an error");
                            }
                        } else {
                            if ($return_fault_reason) {
                                $ret[$workers[(int)$info['handle']]] = array(true, 0, "got a http " . $code . " redirect, which is considered a success");
                            } else {
                                $ret[] = $workers[(int)$info['handle']];
                            }
                        }
                    } elseif ($code[0] === "2") {
                        if ($return_fault_reason) {
                            $ret[$workers[(int)$info['handle']]] = array(true, 0, "got a http " . $code . " code, which is considered a success");
                        } else {
                            $ret[] = $workers[(int)$info['handle']];
                        }
                    } else {
                        // all non-2xx and non-3xx are always considered errors (500 internal server error, 400 client error, 404 not found, etcetc)
                        if ($return_fault_reason) {
                            $ret[$workers[(int)$info['handle']]] = array(false, -1, "got a http " . $code . " code, which is considered an error");
                        }
                    }
                }
                curl_multi_remove_handle($mh, $info['handle']);
                assert(isset($workers[(int)$info['handle']]));
                unset($workers[(int)$info['handle']]);
                curl_close($info['handle']);
            }
        }
        //echo "NO MORE INFO!";
    };
    foreach ($urls as $url) {
        while (count($workers) >= $max_connections) {
            //echo "TOO MANY WORKERS!\n";
            $work();
        }
        $neww = curl_init($url);
        if (!$neww) {
            trigger_error("curl_init() failed! probably means that max_connections is too high and you ran out of resources", E_USER_WARNING);
            if ($return_fault_reason) {
                $ret[$url] = array(false, -1, "curl_init() failed");
            }
            continue;
        }
        $workers[(int)$neww] = $url;
        curl_setopt_array($neww, array(
            CURLOPT_NOBODY => 1,
            CURLOPT_SSL_VERIFYHOST => 0,
            CURLOPT_SSL_VERIFYPEER => 0,
            CURLOPT_TIMEOUT_MS => $timeout_ms
        ));
        curl_multi_add_handle($mh, $neww);
        //curl_multi_exec($mh, $unused_here); LIKELY TO BE MUCH SLOWER IF DONE IN THIS LOOP: TOO MANY SYSCALLS
    }
    while (count($workers) > 0) {
        //echo "WAITING FOR WORKERS TO BECOME 0!";
        //var_dump(count($workers));
        $work();
    }
    curl_multi_close($mh);
    return $ret;
}

and benchmark it with doing the same in a headless browser, i dare you

for the record, ratma.net is in Canada, and here is from another datacenter but also in Canada:

foo@foo:/srv/http/default/www# time php foo.php

real    0m32.606s
user    0m19.561s
sys     0m12.991s

it completed 100,000 requests in 32.6 seconds, that means 3067 requests per second. i haven't actually checked, but i expect a headless browser to perform significantly worse than that.

(ps note that this script does not download the entire content, and it issue a HTTP HEAD request instead of a HTTP GET request, if you want it to download the entire content then replace CURLOPT_NOBODY=>1 with CURLOPT_WRITEFUNCTION=>function($ch,string $data){return strlen($data);} )

hanshenrik
  • 19,904
  • 4
  • 43
  • 89
4

The best way to decide is to test both, but based on my general experience with this type of automation, curl is likely to be faster.

Headless browsers are useful when you need to fully emulate a real browser—for example, when you need to ensure that JavaScript on the page runs, or when you need to examine a DOM that may be dynamically updated.

If you only care about requesting a particular resource then there’s no need for a headless browser and a simple utility like curl or HTTPie can be easier to work with.

Nate
  • 18,752
  • 8
  • 48
  • 54
  • and in my experience, [libcurl's curl_multi api](https://curl.haxx.se/libcurl/c/libcurl-multi.html) is a hell of a lot faster than curl the cli program, or wget, and much faster than i've ever seen a browser, headless or not, perform XMLHttpRequests :) if you're interesting in this stuff, you should read my response – hanshenrik Feb 23 '19 at 12:14