3

I have around 600k of image URLs in different tables and am downloading all the images with the code below and it is working fine. (I know FTP is the best option but somehow I can’t use it.)

$queryRes = mysql_query("SELECT url FROM tablName LIMIT 50000"); // everytime I am using LIMIT
while ($row = mysql_fetch_object($queryRes)) {
    $info = pathinfo($row->url);
    $fileName = $info['filename'];
    $fileExtension = $info['extension'];

    try {
        copy("http:".$row->url, "img/$fileName"."_".$row->id.".".$fileExtension);
    } catch(Exception $e) {
        echo "<br/>\n unable to copy '$fileName'. Error:$e";
    }
}

Problems are:

  1. After some time, say 10 minutes, scripts give 503 error. But still continue downloading the images. Why, it should stop copying it?
  2. And it does not download all the images, everytime there will be difference of 100 to 150 images. So how can I trace which images are not downloaded?

I hope I have explained well.

Palec
  • 12,743
  • 8
  • 69
  • 138
Suresh Kamrushi
  • 15,627
  • 13
  • 75
  • 90
  • 1
    IS there a possibility for using `rsync`? – u_mulder Dec 09 '13 at 06:44
  • this library is not related what i am looking for – Suresh Kamrushi Dec 09 '13 at 06:46
  • Reposting question if i have not made it clear: 1) even after getting 503 it still to continue down the images, how? 2) How to trace, which images are not downloaded? – Suresh Kamrushi Dec 13 '13 at 05:44
  • @SureshKamrushi I'd like to add a debugging angle to this. I'd suggest.. read table entry -> add to temp table and copy file -> repeat. In this scenario, you might find you will have to run a few times. But will overcome a few timeout/connection issues. Like if the server is dropping you for too many connections. This will copy the files, plus keep a record of them, So next run/refresh, it will only copy new files. Like I said, in the case it is a connection issue, worth a try if you haven't already thought of it. – Bradmage Dec 13 '13 at 12:57
  • @SureshKamrushi check my answer. Don't download all of those images in one go. Download them in batches. – Gogol Dec 14 '13 at 19:24
  • Are you running this with `cli`? If from the browser you should output at least something after each `copy()` is done so that it stops downloading when you see `503`. – Ja͢ck Dec 16 '13 at 04:07

6 Answers6

3

first of all... copy will not throw any exception... so you are not doing any error handling... thats why your script will continue to run...

second... you should use file_get_contets or even better, curl...

for example you could try this function... (I know... its open and closes curl every time... just an example i found here https://stackoverflow.com/a/6307010/1164866)

function getimg($url) {         
    $headers[] = 'Accept: image/gif, image/x-bitmap, image/jpeg, image/pjpeg';              
    $headers[] = 'Connection: Keep-Alive';         
    $headers[] = 'Content-type: application/x-www-form-urlencoded;charset=UTF-8';         
    $user_agent = 'php';         
    $process = curl_init($url);         
    curl_setopt($process, CURLOPT_HTTPHEADER, $headers);         
    curl_setopt($process, CURLOPT_HEADER, 0);         
    curl_setopt($process, CURLOPT_USERAGENT, $useragent);         
    curl_setopt($process, CURLOPT_TIMEOUT, 30);         
    curl_setopt($process, CURLOPT_RETURNTRANSFER, 1);         
    curl_setopt($process, CURLOPT_FOLLOWLOCATION, 1);         
    $return = curl_exec($process);         
    curl_close($process);         
    return $return;     
} 

or even.. try to doit with curl_multi_exec and get your files dowloaded in parallel, wich will be a lot faster

take a look here:

http://www.php.net/manual/en/function.curl-multi-exec.php

edit:

to track wich files failed to download you need to do something like this

$queryRes = mysql_query("select url from tablName limit 50000"); //everytime i am using limit
while($row = mysql_fetch_object($queryRes)) {

    $info = pathinfo($row->url);    
    $fileName = $info['filename'];
    $fileExtension = $info['extension'];    

    if (!@copy("http:".$row->url, "img/$fileName"."_".$row->id.".".$fileExtension)) {
       $errors= error_get_last();
       echo "COPY ERROR: ".$errors['type'];
       echo "<br />\n".$errors['message'];
       //you can add what ever code you wnat here... out put to conselo, log in a file put an exit() to stop dowloading... 
    }
}

more info: http://www.php.net/manual/es/function.copy.php#83955

Community
  • 1
  • 1
Javier Neyra
  • 1,239
  • 11
  • 12
  • @Javir :Thanks for your answer. But what you have suggested is an alternative (might be better) for copying images. But this not what i am looking for. Please see my comments in question. – Suresh Kamrushi Dec 13 '13 at 05:47
  • @SureshKamrushi take a look at the edit... that its what you need to check wich files didnt dowload – Javier Neyra Dec 13 '13 at 17:22
1

It is better handled batch-by-batch.

The actual script Table structure

CREATE TABLE IF NOT EXISTS `images` (
  `id` int(60) NOT NULL AUTO_INCREMENTh,
  `link` varchar(1024) NOT NULL,
  `status` enum('not fetched','fetched') NOT NULL DEFAULT 'not fetched',
  `timestamp` timestamp NOT NULL DEFAULT CURRENT_TIMESTAMP ON UPDATE CURRENT_TIMESTAMP,
  PRIMARY KEY (`id`)
);

The script

<?php
// how many images to download in one go?
$limit = 100;
/* if set to true, the scraper reloads itself. Good for running on localhost without cron job support. Just keep the browser open and the script runs by itself ( javascript is needed) */ 
$reload = false;
// to prevent php timeout
set_time_limit(0);
// db connection ( you need pdo enabled)   
 try {
       $host = 'localhost';
       $dbname= 'mydbname';
       $user = 'root';
       $pass = '';
      $DBH = new PDO("mysql:host=$host;dbname=$dbname", $user, $pass);     
    }  
    catch(PDOException $e) {  
        echo $e->getMessage();  
    } 
$DBH->setAttribute( PDO::ATTR_ERRMODE, PDO::ERRMODE_EXCEPTION );

// get n number of images that are not fetched
$query = $DBH->prepare("SELECT * FROM images WHERE  status = 'not fetched' LIMIT {$limit}");
$query->execute();
$files = $query->fetchAll();
// if no result, don't run
if(empty($files)){
    echo 'All files have been fetched!!!';
    die();
}
// where to save the images?
$savepath = dirname(__FILE__).'/scrapped/';
// fetch 'em!
foreach($files as $file){
        // get_url_content uses curl. Function defined later-on
    $content = get_url_content($file['link']);
        // get the file name from the url. You can use random name too. 
        $url_parts_array = explode('/' , $file['link']);
        /* assuming the image url as http:// abc . com/images/myimage.png , if we explode the string by /, the last element of the exploded array would have the filename */
        $filename = $url_parts_array[count($url_parts_array) - 1]; 
        // save fetched image
    file_put_contents($savepath.$filename , $content);
    // did the image save?
       if(file_exists($savepath.$file['link']))
       {
        // yes? Okay, let's save the status
              $query = $DBH->prepare("update images set status = 'fetched' WHERE id = ".$file['id']);
        // output the name of the file that just got downloaded
                echo $file['link']; echo '<br/>';
        $query->execute();  
    }
}

// function definition get_url_content()
function get_url_content($url){
        // ummm let's make our bot look like human
    $agent= 'Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; .NET CLR 1.0.3705; .NET CLR 1.1.4322)';
    $ch = curl_init();
    curl_setopt($ch, CURLOPT_SSL_VERIFYPEER, false);
    curl_setopt($ch, CURLOPT_VERBOSE, true);
    curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
    curl_setopt($ch, CURLOPT_BINARYTRANSFER, 1);
    curl_setopt($ch, CURLOPT_USERAGENT, $agent);
    curl_setopt($ch, CURLOPT_URL,$url);
    return curl_exec($ch);
}
//reload enabled? Reload!
if($reload)
    echo '<script>location.reload(true);</script>';
Gogol
  • 3,033
  • 4
  • 28
  • 57
  • Not useful as written now, others have already pointed out that splitting the huge batch is necessary. Too much self-promo as the article could as well be posted here, it is short enough. Also it resembles [Javier Neyra’s answer](http://stackoverflow.com/a/20558503/2157640), just expanded. IMO commenting on that answer should be enough. – Palec Dec 15 '13 at 11:47
  • That was when I was n00b in SOF :p I have updated the answer. – Gogol Jul 04 '17 at 11:31
0

I haven't used copy myself, I'd use file_get_contents it works fine with remote servers.

edit:

also returns false. so...

if( false === file_get_contents(...) )
    trigger_error(...);
Bradmage
  • 1,233
  • 1
  • 15
  • 41
0
  1. I think 50000 is too large. Network is every time consuming, downloading an image might cost over 100 ms(depend on your nerwork condition), so 50000 images, in the most stable case(without timeout or some other errors), might cost 50000*100/1000/60 = 83 mins, that's really a long time for script like php. If you run this script as a cgi(not cli), normally you only got 30 secs by default(without set_time_limit). So I recommend making this script a cronjob and run it every 10 secs to fetch about 50 urls maybe.

  2. To make the script only fetch a few images each time, you must remember which ones have been processed(successfully) alreay. For example, you can add a flag column to the url table, by default, the flag = 1, if url processed successfully, it becomes 2, or it becomes 3, which means the url got something wrong. And each time, the script can only select the ones which flag=1(3 might be also included, but sometimes, the url might be so wrong so re-try won't work).

  3. copy function is too simple, I recommend using curl instead, it's more reliable, and you can got the exactlly network info of downloading.

Here the code:

//only fetch 50 urls each time
$queryRes = mysql_query ( "select id, url from tablName where flag=1 limit  50" );

//just prefer absolute path
$imgDirPath = dirname ( __FILE__ ) + '/';

while ( $row = mysql_fetch_object ( $queryRes ) )
{
    $info = pathinfo ( $row->url );
    $fileName = $info ['filename'];
    $fileExtension = $info ['extension'];

    //url in the table is like //www.example.com???
    $result = fetchUrl ( "http:" . $row->url, 
            $imgDirPath + "img/$fileName" . "_" . $row->id . "." . $fileExtension );

    if ($result !== true)
    {
        echo "<br/>\n unable to copy '$fileName'. Error:$result";
        //update flag to 3, finish this func yourself
        set_row_flag ( 3, $row->id );
    }
    else
    {
        //update flag to 3
        set_row_flag ( 2, $row->id );
    }
}

function fetchUrl($url, $saveto)
{
    $ch = curl_init ( $url );

    curl_setopt ( $ch, CURLOPT_FOLLOWLOCATION, true );
    curl_setopt ( $ch, CURLOPT_MAXREDIRS, 3 );
    curl_setopt ( $ch, CURLOPT_HEADER, false );
    curl_setopt ( $ch, CURLOPT_RETURNTRANSFER, true );
    curl_setopt ( $ch, CURLOPT_CONNECTTIMEOUT, 7 );
    curl_setopt ( $ch, CURLOPT_TIMEOUT, 60 );

    $raw = curl_exec ( $ch );

    $error = false;

    if (curl_errno ( $ch ))
    {
        $error = curl_error ( $ch );
    }
    else
    {
        $httpCode = curl_getinfo ( $ch, CURLINFO_HTTP_CODE );

        if ($httpCode != 200)
        {
            $error = 'HTTP code not 200: ' . $httpCode;
        }
    }

    curl_close ( $ch );

    if ($error)
    {
        return $error;
    }

    file_put_contents ( $saveto, $raw );

    return true;
}
Andrew
  • 5,290
  • 1
  • 19
  • 22
  • Thanks for your answer. But what you have suggested is an alternative (might be better) for copying images. But this not what i am looking for. Please see my comments in question. – Suresh Kamrushi Dec 13 '13 at 05:46
0
  1. Strict checking for mysql_fetch_object return value is IMO better as many similar functions may return non-boolean value evaluating to false when checking loosely (e.g. via !=).
  2. You do not fetch id attribute in your query. Your code should not work as you wrote it.
  3. You define no order of rows in the result. It is almost always desirable to have an explicit order.
  4. The LIMIT clause leads to processing only a limited number of rows. If I get it correctly, you want to process all the URLs.
  5. You are using a deprecated API to access MySQL. You should consider using a more modern one. See the database FAQ @ PHP.net. I did not fix this one.
  6. As already said multiple times, copy does not throw, it returns success indicator.
  7. Variable expansion was clumsy. This one is purely cosmetic change, though.
  8. To be sure the generated output gets to the user ASAP, use flush. When using output buffering (ob_start etc.), it needs to be handled too.

With fixes applied, the code now looks like this:

$queryRes = mysql_query("SELECT id, url FROM tablName ORDER BY id");
while (($row = mysql_fetch_object($queryRes)) !== false) {
    $info = pathinfo($row->url);
    $fn = $info['filename'];
    if (copy(
        'http:' . $row->url,
        "img/{$fn}_{$row->id}.{$info['extension']}"
    )) {
        echo "success: $fn\n";
    } else {
        echo "fail: $fn\n";
    }
    flush();
}

The issue #2 is solved by this. You will see which files were and were not copied. If the process (and its output) stops too early, then you know the id of the last processed row and you can query your DB for the higher ones (not processed). Another approach is adding a boolean column copied to tblName and updating it immediately after successfully copying the file. Then you may want to change the query in the code above to not include rows with copied = 1 already set.

The issue #1 is addressed in Long computation in php results in 503 error here on SO and 503 service unavailable when debugging PHP script in Zend Studio on SU. I would recommend splitting the large batch to smaller ones, launching in a fixed interval. Cron seems to be the best option to me. Is there any need to lauch this huge batch from browser? It will run for a very long time.

Community
  • 1
  • 1
Palec
  • 12,743
  • 8
  • 69
  • 138
-1

503 is a fairly generic error, which in this case probably means something timed out. This could be your web server, a proxy somewhere along the way, or even PHP.

You need to identify which component is timing out. If it's PHP, you can use set_time_limit.

Another option might be to break the work up so that you only process one file per request, then redirect back to the same script to continue processing the rest. You would have to somehow maintain a list of which files have been processed between calls. Or process in order of database id, and pass the last used id to the script when you redirect.

rich remer
  • 3,407
  • 2
  • 34
  • 47