-2

We have about 80K (just main) images of products on our server, which are copy of supplier's images. The supplier constantly changing them, but we don't know which was changed and which was not (the name of file is same all the time) and we need images to stay fresh.

My idea is to take the last-modified header value of each image on supplier's server and compare to our last-modified time. Then if our time is lower download new image from server.

I made a php console script which is using curl multi request made with this lib: ParallelCurl - github.

My PHP code is:

function setComparatorData( $model, $filetime ) {
    global $comparator;

    if ( file_exists(DIR_IMAGE . "catalog/" . $model . ".jpg") ) {
        $localFileTime = filemtime(DIR_IMAGE . "catalog/" . $model . ".jpg");
        if ( $localFileTime > $filetime ) return;
    }

    $comparator[$model] = $filetime;
}

function onReceived($content, $url, $ch, $request) {
    $data = curl_getinfo($ch);
    setComparatorData($request['model'], $data['filetime']);
}

function request($limit = 100) {
    $products = array(); // This is array of products from database

    $curl_options = array(
        CURLOPT_SSL_VERIFYPEER  => FALSE,
        CURLOPT_SSL_VERIFYHOST  => FALSE,
        CURLOPT_USERAGENT       => 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/71.0.3578.98 Safari/537.36',
        CURLOPT_NOBODY          => TRUE,
        CURLOPT_FOLLOWLOCATION  => TRUE,
        CURLOPT_HEADER          => TRUE,
        CURLOPT_FILETIME        => TRUE,
        CURLOPT_TIMEOUT         => 5,
        CURLOPT_FRESH_CONNECT   => TRUE // For test!
    );

    $parallel_curl = new ParallelCurl($limit, $curl_options);

    foreach ($products as $product) {
        $parallel_curl->startRequest("http://supplierImageUrlBasedOnProductVariable",'onReceived', array("model" => $product['model'], "source" => "remote"));
    }

    $parallel_curl->finishAllRequests();
}


$comparator = array();
request(100);
print_r($comparator);

This splits multirequest to 100 parallel requests and after one "group" is finished, the next will start. My problem is, that this is slow as hell. For 600 requests(product images) it took 8 seconds, but for 5000 it was running half hour (then I stopped it).

I believe the biggest problem is PHP, but maybe I am wrong. Does someone have an idea how to solve this speed issue? Should I rewrite it to python or bash script? Will it help? Or is there a small mistake in code which is causing slow response?

Maybe my solution is completly wrong, if someone has any other idea, write down how.

Thank you

Rossko_DCA
  • 51
  • 1
  • 9
  • 1
    Sounds like a code review question. – Andreas Feb 07 '19 at 11:36
  • Profile your code. Blackfire has a nice free tier. It could be anything really - your code could be leaking memory and causing paging, the 3rd party server might be throttling your IP after x requests etc etc – Steve Feb 07 '19 at 11:41
  • @Martijn 67 seconds – Steve Feb 07 '19 at 11:58
  • If you have no leaks, 5000 should've taken you 67 seconds (based on 8s for 600). Thats means you have a leak :) Easy test is to see if you SWAP is being used a lot. – Martijn Feb 07 '19 at 12:00
  • 1
    Do you really want to send 5000 different requests to their web server in a smaller timeframe than 30 minutes? They could see that as an attack on their server. 80000 is even worse. – Devon Bessemer Feb 07 '19 at 12:03
  • @Devon I am sending 60k on their API, perfectly fine in 300 seconds approx. This is not what I am asking, bro – Rossko_DCA Feb 07 '19 at 12:10
  • 1
    Well, you're going about this the wrong way is my point. You should be treating your server as a CDN and only checking when someone requests the image once a cache has expired, not constantly polling their server with thousands of requests. I'd hate to have a single API consumer sending as many requests as you. – Devon Bessemer Feb 07 '19 at 12:13
  • Well, there's problem that our customers are buying good's by image and they receive completely another product, because supplier is changing that images.. – Rossko_DCA Feb 07 '19 at 12:16
  • So back to my earlier comment - dont guess - profile your code. That said, ff you have official API access and a good relationship, i would start by asking if they can add a query parameter to their api `GET https://some.api.com/product-images?updated-since={datetime}` – Steve Feb 07 '19 at 12:19
  • 1
    ..well then you set the cache timeout too high for your use case. Read up more on how CDNs work – Devon Bessemer Feb 07 '19 at 12:19
  • 2019-02-16: you should update the code, i made a small but significant performance optimization to the initialization loop, significantly reducing syscalls, it should run significantly faster on large lists now – hanshenrik Feb 16 '19 at 09:25

2 Answers2

2

modifying the code from https://stackoverflow.com/a/54353191/1067003 (which was designed to be very fast with large lists), i get

function last_modified_from_urls(array $urls, int $max_connections, int $timeout_ms = 10000) : array
{
    if ($max_connections < 1) {
        throw new InvalidArgumentException("max_connections MUST be >=1");
    }
    foreach ($urls as $key => $foo) {
        if (!is_string($foo)) {
            throw new \InvalidArgumentException("all urls must be strings!");
        }
        if (empty($foo)) {
            unset($urls[$key]); //?
        }
    }
    unset($foo);
    $urls = array_unique($urls); // remove duplicates.
    $ret = array();
    $mh = curl_multi_init();
    $workers = array();
    $headerfunction = function ($ch, string $header) use (&$ret, &$workers) {
        $lm = 'Last-Modified:';
        if (0 === stripos($header, $lm)) {
            $save = trim(substr($header, strlen($lm)));
            $ret[$workers[(int)$ch]] = $save;
        }
        return strlen($header);
    };
    $work = function () use (&$ret, &$workers, &$mh, &$return_fault_reason) {
        // > If an added handle fails very quickly, it may never be counted as a running_handle
        while (1) {
            curl_multi_exec($mh, $still_running);
            if ($still_running < count($workers)) {
                break;
            }
            $cms = curl_multi_select($mh, 10);
            //var_dump('sr: ' . $still_running . " c: " . count($workers)." cms: ".$cms);
        }
        while (false !== ($info = curl_multi_info_read($mh))) {
            //echo "NOT FALSE!";
            //var_dump($info);
            {
                if ($info['msg'] !== CURLMSG_DONE) {
                    continue;
                }
                if ($info['result'] !== CURLM_OK) {
                    $ret[$workers[(int)$info['handle']]] = array(false, $info['result'], "curl_exec error " . $info['result'] . ": " . curl_strerror($info['result']));
                } elseif (CURLE_OK !== ($err = curl_errno($info['handle']))) {
                    $ret[$workers[(int)$info['handle']]] = array(false, $err, "curl error " . $err . ": " . curl_strerror($err));
                } else {
                    $code = (string)curl_getinfo($info['handle'], CURLINFO_HTTP_CODE);
                    if ($code[0] === "2") {
                        if (!isset($ret[$workers[(int)$info['handle']]])) {
                            $ret[$workers[(int)$info['handle']]] = array(false, 0, "did not get a Last-Modified header!");
                        } else {
                            assert(
                                is_string($ret[$workers[(int)$info['handle']]]),
                                "last modified should be set by the headerfunction."
                            );
                        }
                    } else {
                        // all non-2xx and non-3xx are always considered errors (500 internal server error, 400 client error, 404 not found, etcetc)
                        $ret[$workers[(int)$info['handle']]] = array(false, -1, "got a http " . $code . " code, which is considered an error");
                    }
                }
                curl_multi_remove_handle($mh, $info['handle']);
                assert(isset($workers[(int)$info['handle']]));
                unset($workers[(int)$info['handle']]);
                curl_close($info['handle']);
            }
        }
        //echo "NO MORE INFO!";
    };
    foreach ($urls as $url) {
        while (count($workers) >= $max_connections) {
            //echo "TOO MANY WORKERS!\n";
            $work();
        }
        $neww = curl_init($url);
        if (!$neww) {
            trigger_error("curl_init() failed! probably means that max_connections is too high and you ran out of resources", E_USER_WARNING);
            $ret[$url] = array(false, -1, "curl_init() failed");
            continue;
        }
        $workers[(int)$neww] = $url;
        curl_setopt_array($neww, array(
            CURLOPT_NOBODY => 1,
            CURLOPT_HEADERFUNCTION => $headerfunction,
            CURLOPT_SSL_VERIFYHOST => 0,
            CURLOPT_SSL_VERIFYPEER => 0,
            CURLOPT_TIMEOUT_MS => $timeout_ms
        ));
        curl_multi_add_handle($mh, $neww);
        //curl_multi_exec($mh, $unused_here); LIKELY TO BE MUCH SLOWER IF DONE IN THIS LOOP: TOO MANY SYSCALLS

    }
    while (count($workers) > 0) {
        //echo "WAITING FOR WORKERS TO BECOME 0!";
        //var_dump(count($workers));
        $work();
    }
    curl_multi_close($mh);
    return $ret;
}

which should be close to as fast as you can get it with curl_multi, usage:

$urls = array(
    'example.com',
    'example.org',
    'ratma.net'
);
var_dump(
    last_modified_from_urls(
        $urls,
        500
    )
);

returning:

array(3) {
  ["example.com"]=>
  string(29) "Fri, 09 Aug 2013 23:54:35 GMT"
  ["ratma.net"]=>
  string(29) "Thu, 09 Nov 2017 12:44:58 GMT"
  ["example.org"]=>
  string(29) "Fri, 09 Aug 2013 23:54:35 GMT"
}
hanshenrik
  • 19,904
  • 4
  • 43
  • 89
1

You can store last updated date time in to one file. Read that file and get all file list updated after last execution and update all new files from server.

$ReminderFile = __DIR__ ."/check_hour.txt";
if(!file_exists($ReminderFile)) {
    $handle = fopen($ReminderFile, "w");
    $lastExecuteDate = date("Y-m-d h:i:sa");
    fwrite($handle, $lastExecuteDate);
} else {
    $handle = fopen($ReminderFile, "r");
    $lastExecuteDate = fread($handle,filesize($ReminderFile));
}


/**
  * @param Array  : array of files path and name
  * @param String : date selector 
  * @param String : optional, the passed date format default is m:d:Y ex, 09:30:2015 @link http://php.net/manual/en/function.date.php for more options
  * @return Array : array of filtered files path and name
  */
function fileFilter ($files, $date, $format = 'Y-m-d h:i:sa') {
    $selectedFiles = array ();

    foreach ($files as $file) {
        if (date ($format, filemtime ($file)) == $date) {
            $selectedFiles[] = $file; 
        }
    }
    return $selectedFiles;
}
// example :
var_dump(fileFilter (glob("C:/*.*"), $lastExecuteDate));

/** Update date in text file after execution **/
    $handle = fopen(__DIR__ ."/check_hour.txt", "w");
    $lastExecuteDate = date('Y-m-d H:i:sa');
    fwrite($handle, $lastExecuteDate);
Bhavin Solanki
  • 4,740
  • 3
  • 26
  • 46
  • This is not what I need. I need to compare the "updated time" of "local"-our server files with the remote files. I have the algorithm to compare them, I need to get information about remote images. – Rossko_DCA Feb 07 '19 at 12:07
  • you can get latest updated remote files from server directory and compare it. – Bhavin Solanki Feb 08 '19 at 09:12