We have about 80K (just main) images of products on our server, which are copy of supplier's images. The supplier constantly changing them, but we don't know which was changed and which was not (the name of file is same all the time) and we need images to stay fresh.
My idea is to take the last-modified header value of each image on supplier's server and compare to our last-modified time. Then if our time is lower download new image from server.
I made a php console script which is using curl multi request made with this lib: ParallelCurl - github.
My PHP code is:
function setComparatorData( $model, $filetime ) {
global $comparator;
if ( file_exists(DIR_IMAGE . "catalog/" . $model . ".jpg") ) {
$localFileTime = filemtime(DIR_IMAGE . "catalog/" . $model . ".jpg");
if ( $localFileTime > $filetime ) return;
}
$comparator[$model] = $filetime;
}
function onReceived($content, $url, $ch, $request) {
$data = curl_getinfo($ch);
setComparatorData($request['model'], $data['filetime']);
}
function request($limit = 100) {
$products = array(); // This is array of products from database
$curl_options = array(
CURLOPT_SSL_VERIFYPEER => FALSE,
CURLOPT_SSL_VERIFYHOST => FALSE,
CURLOPT_USERAGENT => 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/71.0.3578.98 Safari/537.36',
CURLOPT_NOBODY => TRUE,
CURLOPT_FOLLOWLOCATION => TRUE,
CURLOPT_HEADER => TRUE,
CURLOPT_FILETIME => TRUE,
CURLOPT_TIMEOUT => 5,
CURLOPT_FRESH_CONNECT => TRUE // For test!
);
$parallel_curl = new ParallelCurl($limit, $curl_options);
foreach ($products as $product) {
$parallel_curl->startRequest("http://supplierImageUrlBasedOnProductVariable",'onReceived', array("model" => $product['model'], "source" => "remote"));
}
$parallel_curl->finishAllRequests();
}
$comparator = array();
request(100);
print_r($comparator);
This splits multirequest to 100 parallel requests and after one "group" is finished, the next will start. My problem is, that this is slow as hell. For 600 requests(product images) it took 8 seconds, but for 5000 it was running half hour (then I stopped it).
I believe the biggest problem is PHP, but maybe I am wrong. Does someone have an idea how to solve this speed issue? Should I rewrite it to python or bash script? Will it help? Or is there a small mistake in code which is causing slow response?
Maybe my solution is completly wrong, if someone has any other idea, write down how.
Thank you