Download millions of images from external website

Question

I am working on a real estate website and we're about to get an external feed of ~1M listings. Assuming each listing has ~10 photos associated with it, that's about ~10M photos, and we're required to download each of them to our server so as to not "hot link" to them.

I'm at a complete loss as to how to do this efficiently. I played with some numbers and I concluded, based on a 0.5 second per image download rate, this could take upwards of ~58 days to complete (download ~10M images from an external server). Which is obviously unacceptable.

Each photo seems to be roughly ~50KB, but that can vary with some being larger, much larger, and some being smaller.

I've been testing by simply using:

copy(http://www.external-site.com/image1.jpg, /path/to/folder/image1.jpg)

I've also tried cURL, wget, and others.

I know other sites do it, and at a much larger scale, but I haven't the slightest clue how they manage this sort of thing without it taking months at a time.

Sudo code based on the XML feed we're set to receive. We're parsing the XML using PHP:

<listing>
    <listing_id>12345</listing_id>
    <listing_photos>
        <photo>http://example.com/photo1.jpg</photo>
        <photo>http://example.com/photo2.jpg</photo>
        <photo>http://example.com/photo3.jpg</photo>
        <photo>http://example.com/photo4.jpg</photo>
        <photo>http://example.com/photo5.jpg</photo>
        <photo>http://example.com/photo6.jpg</photo>
        <photo>http://example.com/photo7.jpg</photo>
        <photo>http://example.com/photo8.jpg</photo>
        <photo>http://example.com/photo9.jpg</photo>
        <photo>http://example.com/photo10.jpg</photo>
    </listing_photos>
</listing>

So my script will iterate through each photo for a specific listing and download the photo to our server, and also insert the photo name into our photo database (the insert part is already done without issue).

Any thoughts?

Preferably at the time of initial parsing (of the XML feed file). But other ideas are more than welcome. — mferly, Jan 16 '15 at 19:09
the downloads do not have to be done in serial - you could fetch all 10 simultaneously which could significantly reduce download time. — tomfumb, Jan 16 '15 at 19:15
Thanks @tomfumb Using PHP, do you have any recommendations on how to accomplish that? — mferly, Jan 16 '15 at 19:26
use 'wget' in multiple 'background tasks'. generate source and destination details from the 'details' you have . The time limit is decided by the speed of your internet connection and the processing power of your server - in as to how many 'wget' tasks you can run simultaneously. — Ryan Vincent, Jan 16 '15 at 19:37
You might also want to ask the current image host how many requests they can handle. You wouldn't want to flood them with 1000 requests/second without letting them know beforehand. — BrokenBinary, Jan 16 '15 at 19:46
As should be clear - getting the 'millions' of images in a 'reasonable' time is not an issue. You will need to download 'about' one terabyte of information, assuming 100K per image. You need to talk to the 'provider of the images' as to what 'rate' is acceptable. — Ryan Vincent, Jan 16 '15 at 20:10
Great thoughts people. And good thinking on notifying them beforehand. — mferly, Jan 16 '15 at 20:56

score 2 · Answer 1 · edited Jun 20 '20 at 09:12

Before you do this

Like @BrokenBinar said in the comments. Take into account how many requests per second the host can provide. You don't want to flood them with requests without them knowing. Then use something like sleep to limit your requests per whatever number it is they can provide.

Curl Multi

Anyway, use Curl. Somewhat of a duplicate answer but copied anyway:

$nodes = array($url1, $url2, $url3);
$node_count = count($nodes);

$curl_arr = array();
$master = curl_multi_init();

for($i = 0; $i < $node_count; $i++)
{
    $url =$nodes[$i];
    $curl_arr[$i] = curl_init($url);
    curl_setopt($curl_arr[$i], CURLOPT_RETURNTRANSFER, true);
    curl_multi_add_handle($master, $curl_arr[$i]);
}

do {
    curl_multi_exec($master,$running);
} while($running > 0);


for($i = 0; $i < $node_count; $i++)
{
    $results[] = curl_multi_getcontent  ( $curl_arr[$i]  );
}
print_r($results);

From: PHP Parallel curl requests

Another solution:

Pthread

<?php

class WebRequest extends Stackable {
    public $request_url;
    public $response_body;

    public function __construct($request_url) {
        $this->request_url = $request_url;
    }

    public function run(){
        $this->response_body = file_get_contents(
            $this->request_url);
    }
}

class WebWorker extends Worker {
    public function run(){}
}

$list = array(
    new WebRequest("http://google.com"),
    new WebRequest("http://www.php.net")
);

$max = 8;
$threads = array();
$start = microtime(true);

/* start some workers */
while (@$thread++<$max) {
    $threads[$thread] = new WebWorker();
    $threads[$thread]->start();
}

/* stack the jobs onto workers */
foreach ($list as $job) {
    $threads[array_rand($threads)]->stack(
        $job);
}

/* wait for completion */
foreach ($threads as $thread) {
    $thread->shutdown();
}

$time = microtime(true) - $start;

/* tell you all about it */
printf("Fetched %d responses in %.3f seconds\n", count($list), $time);
$length = 0;
foreach ($list as $listed) {
    $length += strlen($listed["response_body"]);
}
printf("Total of %d bytes\n", $length);
?>

Source: PHP testing between pthreads and curl

You should really use the search feature, ya know :)

Excellent. Thanks Tek. I just gave this a try and it seems to have cut the processing time in half.. even greater than half. But my process seems to be pegged at ~13 photos per second. Won't go above that. Is there any way to really "open this up" and grab more images per second? Or is that really just based on the hardware you think. — mferly, Jan 16 '15 at 21:10
I know.. but I did do a search! lol Seems you're much better than I am for that. The above code appears to be part of a larger class. Any thoughts on what it's doing exactly? — mferly, Jan 16 '15 at 22:36
@Marcus There's not many suggestions I can make without personally knowing what kind of environment you're running. I'd give it a try. And ah yeah, I've gotten that a couple of times. Good thing that only happens once in a blue moon :P — Tek, Jan 16 '15 at 22:37

Jakub Filipczyk · Answer 2 · 2015-01-16T19:56:06.620

You can save all links into some database table (it will be yours "job queue"), Then you can create a script which in the loop gets the job and do it (fetch image for a single link and mark job record as done) The script you can execute multiple times f.e. using supervisord. So the job queue will be processed in parallel. If it's to slow you can just execute another worker script (if bandwidth does not slow you down)

If any script hangs for some reason you can easly run it again to get only images that havnt been yet downloaded. Btw supervisord can be configured to automaticaly restart each script if it fails.

Another advantage is that at any time you can check output of those scripts by supervisorctl. To check how many images are still waiting you can easy query the "job queue" table.

score 2 · Accepted Answer · answered Jan 16 '15 at 21:25

I am surprised the vendor is not allowing you to hot-link. The truth is you will not serve every image every month so why download every image? Allowing you to hot link is a better use of everyone's bandwidth.

I manage a catalog with millions of items where the data is local but the images are mostly hot linked. Sometimes we need to hide the source of the image or the vendor requires us to cache the image. To accomplish both goals we use a proxy. We wrote our own proxy but you might find something open source that would meet your needs.

The way the proxy works is that we encrypt and URL encode the encrypted URL string. So http://yourvendor.com/img1.jpg becomes xtX957z. In our markup the img src tag is something like http://ourproxy.com/getImage.ashx?image=xtX957z.

When our proxy receives an image request, it decrypts the image URL. The proxy first looks on disk for the image. We derive the image name from the URL, so it is looking for something like yourvendorcom.img1.jpg. If the proxy cannot find the image on disk, then it uses the decrypted URL to fetch the image from the vendor. It then writes the image to disk and serves it back to the client. This approach has the advantage of being on demand with no wasted bandwidth. I only get the images I need and I only get them once.

I agree. But the vendor syndicates their data to about 20 other sites, so I believe they're looking to conserve their own bandwidth in fear that some images are viewed hundreds, even thousands of times a day. And I like your idea there. I'm going to investigate that method further! — mferly, Jan 16 '15 at 22:32
This seems to be the best solution after countless hours of investigating. Seems rather pointless to download ALL images when some, probably most, won't even be viewed by a user.. at least not for a long time. On-demand makes the most sense. Cheers. — mferly, Jan 19 '15 at 15:22

Download millions of images from external website

3 Answers3

Before you do this

Curl Multi

Another solution:

Pthread