0

I have the following problem with getting images as array. In this code I'm trying to check if images for search Test 1 exist - if yes, then display, if not then try with Test 2 and that's it. Current code can do it but is super slow.

This if (sizeof($matches[1]) > 3) { because this 3 sometimes contains advertisement on crawled website, so this is my secure how to skip it.

My question is how I can speed up code below to get if (sizeof($matches[1]) > 3) { faster? I believe that this makes code very slow, because this array may contain up to 1000 images

$get_search = 'Test 1';

$html = file_get_contents('https://www.everypixel.com/search?q='.$get_search.'&is_id=1&st=free');
preg_match_all('|<img.*?src=[\'"](.*?)[\'"].*?>|i', $html, $matches);

if (sizeof($matches[1]) > 3) {
  $ch_foreach = 1;
}

if ($ch_foreach == 0) {

    $get_search = 'Test 2';

  $html = file_get_contents('https://www.everypixel.com/search?q='.$get_search.'&is_id=1&st=free');
  preg_match_all('|<img.*?src=[\'"](.*?)[\'"].*?>|i', $html, $matches);

  if (sizeof($matches[1]) > 3) {
     $ch_foreach = 1;
  }

}

foreach ($matches[1] as $match) if ($tmp++ < 20) {

  if (@getimagesize($match)) {

    // display image
    echo $match;

  }

}
  • 1
    Possible duplicate of [How do you parse and process HTML/XML in PHP?](https://stackoverflow.com/questions/3577641/how-do-you-parse-and-process-html-xml-in-php) – miken32 Apr 20 '19 at 05:01
  • you're very wrong, checking the size of the array is not the performance issue here, nor is `$ch_foreach = 1;` a performance issue. the potentially slow parts of this code is the HTTP fetch and the regex execution. to optimize the http fetch, switch to curl and use CURLOPT_ENCODING (or even faster, implement a local cache with an updating daemon), and to optimize the html parsing, switch to parsing with DOMDocument & DOMXPath instead of parsing with regex. – hanshenrik Apr 20 '19 at 07:36

1 Answers1

0
$html = file_get_contents('https://www.everypixel.com/search?q='.$get_search.'&is_id=1&st=free');

unless the www.everypixel.com server is is on the same LAN (in which case compression overhead may be slower than transferring it in plain), curl with CURLOPT_ENCODING should do this faster than file_get_contents, and even if it is on the same lan, curl should be faster than file_get_contents because file_get_contents keeps reading until the server close the connection, but curl keeps reading until Content-Length bytes has been read, which is faster than waiting for a server to close a socket, so do this instead:

$ch=curl_init('https://www.everypixel.com/search?q='.$get_search.'&is_id=1&st=free');
curl_setopt_array($ch,array(CURLOPT_ENCODING=>'',CURLOPT_RETURNTRANSFER=>1));
$html=curl_exec($ch);

about your regex:

preg_match_all('|<img.*?src=[\'"](.*?)[\'"].*?>|i', $html, $matches);

DOMDocument with getElementsByTagName("img") and getAttribute("src") should be faster than using your regex, so do this instead:

$domd=@DOMDocument::loadHTML($html);
$urls=[];
foreach($domd->getElementsByTagName("img") as $img){
    $url=$img->getAttribute("src");
    if(!empty($url)){
        $urls[]=$url;
    }
}

and probably the slowest part of your entire code, the @getimagesize($match) inside a loop potentially containing over 1000 urls, every call to getimagesize() with an url makes php download the image, and it uses the file_get_contents method meaning it suffers from the same Content-Length issue that makes file_get_contents slow. in addition, all the images are downloaded sequentially, downloading them in parallel should be much faster, which can be done with the curl_multi api, but doing that is a complex task and i cba writing an example for you, but i can point you to an example: https://stackoverflow.com/a/54717579/1067003

hanshenrik
  • 19,904
  • 4
  • 43
  • 89