5

I am simular some function like http://pinterest.com add a pin

How to get all the images from url which width and height >=200 more quicker? pinterest.com will finish the whole process nearly 10 seconds, but I need 48.64 seconds.

require dirname(__FILE__) . '/simple_html_dom.php';
$url = 'http://www.huffingtonpost.com/';
$html = file_get_html($url);
if($html->find('img')){
    foreach($html->find('img') as $element) {
        $size = @getimagesize($element->src);
        if($size[0]>=200&&$size[1]>=200){
            echo $element;
        }
    }
}// cost 48.64 seconds
fish man
  • 2,666
  • 21
  • 54
  • 94

4 Answers4

10

I think what you use do is run curl requests in parallel using curl_multi_init please see http://php.net/manual/en/function.curl-multi-init.php for more information. This way it will load much faster and escape all bandwidth issue that can also affect speed.

Save the image into a local temp directory not run getimagesize() on the local directly which is much faster than running it over http://

I hope this helps

Edit 1

Note***

A. Not all Images start with http

B. Not all images are valid

C. Create temp folder where the images needs to be stored

Prove of Concept

require 'simple_html_dom.php';
$url = 'http://www.huffingtonpost.com';
$html = file_get_html ( $url );
$nodes = array ();
$start = microtime ();
$res = array ();

if ($html->find ( 'img' )) {
    foreach ( $html->find ( 'img' ) as $element ) {
        if (startsWith ( $element->src, "/" )) {
            $element->src = $url . $element->src;
        }
        if (! startsWith ( $element->src, "http" )) {
            $element->src = $url . "/" . $element->src;
        }
        $nodes [] = $element->src;
    }
}

echo "<pre>";
print_r ( imageDownload ( $nodes, 200, 200 ) );
echo "<h1>", microtime () - $start, "</h1>";

function imageDownload($nodes, $maxHeight = 0, $maxWidth = 0) {

    $mh = curl_multi_init ();
    $curl_array = array ();
    foreach ( $nodes as $i => $url ) {
        $curl_array [$i] = curl_init ( $url );
        curl_setopt ( $curl_array [$i], CURLOPT_RETURNTRANSFER, true );
        curl_setopt ( $curl_array [$i], CURLOPT_USERAGENT, 'Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.9.1.2) Gecko/20090729 Firefox/3.5.2 (.NET CLR 3.5.30729)' );
        curl_setopt ( $curl_array [$i], CURLOPT_CONNECTTIMEOUT, 5 );
        curl_setopt ( $curl_array [$i], CURLOPT_TIMEOUT, 15 );
        curl_multi_add_handle ( $mh, $curl_array [$i] );
    }
    $running = NULL;
    do {
        usleep ( 10000 );
        curl_multi_exec ( $mh, $running );
    } while ( $running > 0 );

    $res = array ();
    foreach ( $nodes as $i => $url ) {
        $curlErrorCode = curl_errno ( $curl_array [$i] );

        if ($curlErrorCode === 0) {
            $info = curl_getinfo ( $curl_array [$i] );
            $ext = getExtention ( $info ['content_type'] );
            if ($info ['content_type'] !== null) {
                $temp = "temp/img" . md5 ( mt_rand () ) . $ext;
                touch ( $temp );
                $imageContent = curl_multi_getcontent ( $curl_array [$i] );
                file_put_contents ( $temp, $imageContent );
                if ($maxHeight == 0 || $maxWidth == 0) {
                    $res [] = $temp;
                } else {
                    $size = getimagesize ( $temp );
                    if ($size [1] >= $maxHeight && $size [0] >= $maxWidth) {
                        $res [] = $temp;
                    } else {
                        unlink ( $temp );
                    }
                }
            }
        }
        curl_multi_remove_handle ( $mh, $curl_array [$i] );
        curl_close ( $curl_array [$i] );

    }

    curl_multi_close ( $mh );
    return $res;
}

function getExtention($type) {
    $type = strtolower ( $type );
    switch ($type) {
        case "image/gif" :
            return ".gif";
            break;
        case "image/png" :
            return ".png";
            break;

        case "image/jpeg" :
            return ".jpg";
            break;

        default :
            return ".img";
            break;
    }
}

function startsWith($str, $prefix) {
    $temp = substr ( $str, 0, strlen ( $prefix ) );
    $temp = strtolower ( $temp );
    $prefix = strtolower ( $prefix );
    return ($temp == $prefix);
}

Output

Array
(
    [0] => temp/img8cdd64d686ee6b925e8706fa35968da4.gif
    [1] => temp/img5811155f8862cd0c3e2746881df9cd9f.gif
    [2] => temp/imga597bf04873859a69373804dc2e2c27e.jpg
    [3] => temp/img0914451e7e5a6f4c883ad7845569029e.jpg
    [4] => temp/imgb1c8c4fa88d0847c99c6f4aa17a0a457.jpg
    [5] => temp/img36e5da68a30df7934a26911f65230819.jpg
    [6] => temp/img068c1aa705296b38f2ec689e5b3172b9.png
    [7] => temp/imgfbeca2410b9a9fb5c08ef88dacd46895.png
)
0.076347

Thanks :)

Baba
  • 94,024
  • 28
  • 166
  • 217
  • that is a great method, many thanks. some trouble: how to get the raw image's url, not the output in local temp folder? – fish man Apr 06 '12 at 07:47
  • 1
    That is easy replace `$res [] = $temp;` with `$res [] = $url;` that would do the tric .. don't forget to also `unlink` everything – Baba Apr 06 '12 at 10:14
  • 1
    @Baba Please modify this condition `if ($size [0] >= $maxHeight && $size [0] >= $maxWidth)` . You might want to change that $size[0] to $size[1] for $maxHeight comparision – Rajasekhar Mar 21 '15 at 05:39
2

getimagesize() will download the ENTIRE image file first, then do the analysis. generally you only need the first couple hundred bytes of the file to get type/resolution details. Plus, it'll be using a separate http request for each image.

A properly optimized system would use a partial-get requests to fetch only the first chunk of the image, and take advantage of http keep-alives to keep TCP connection overhead down to a mininum.

Marc B
  • 356,200
  • 43
  • 426
  • 500
  • 1
    Partial gets are defined here: http://www.w3.org/Protocols/rfc2616/rfc2616-sec14.html#sec14.35 basically just a normal request, but with a `Range:` header to specify which bytes you want transferred. You can use curl to do persistent http requests: http://php.net/curl – Marc B Apr 05 '12 at 21:08
2

Reference

Use imagecreatefromstring, imagesx and imagesy, This should be run in 30 seconds. a bit faster than getimagesize()

function ranger($url){
    $headers = array( "Range: bytes=0-32768" );
    $curl = curl_init($url);
    curl_setopt($curl, CURLOPT_HTTPHEADER, $headers);
    curl_setopt($curl, CURLOPT_RETURNTRANSFER, 1);
    return curl_exec($curl);
    curl_close($curl);
}
require dirname(__FILE__) . '/simple_html_dom.php';
$url = 'http://www.huffingtonpost.com/';
$html = file_get_html($url);
if($html->find('img')){
    foreach($html->find('img') as $element) {
        $raw = ranger($element->src);
        $im = @imagecreatefromstring($raw);
        $width = @imagesx($im);
        $height = @imagesy($im);
        if($width>=200&&$height>=200){
            echo $element;
        }
    }
}
Community
  • 1
  • 1
yuli chika
  • 9,053
  • 20
  • 75
  • 122
1

And what about reading width and height from html? I know some of the images may not have this attributes, but maybe you can just skip images with this attributes smaller than 200px.

It is just an idea to way around but maybe not usable for you.

Juraj.Lorinc
  • 503
  • 6
  • 26