0

I have FTP access to 1 directory that holds all images for all products of the vendor. 1 product has multiple images: variations in size and variations in display of the product.

There is no "list" (XML, CSV, database..) by which I am able to know "what's new". For now the only way I see is to grab all filenames and compare them with the ones in my DB.

The last check counted 998.283 files in that directory. 1 product has multiple variations and there is no documentation of how they are named.

I did an initial grab of the filenames, compared them with my products and saved in database table for "images" with their filenames and date modified (from file).

The next step is to check for "new ones".

What I am doing now is:

// get the file list /
foreach ($this->getFilenamesFromFtp() as $key => $image_data) {
  // I extract data from filenames (product code, size, variation number, extension..) so I can store them in table and later use that as reference (ie. I want to use only large images of variation, not all sizes 
  $data=self::extractDataFromImage($image_data);
  // checking if filename already exists in DB images
  // if there is DB entry (TRUE) it will do nothing, and if there is none it will continue with insertion in DB
  if($this->checkForFilenameInDb($data['filename'])){
  }
  else{
    $export_codes=$this->export->getProductIds();
    // check if product code is in export table - that is do we really need this image
    if($this->functions->in_array_r($data['product_code'],$export_codes)){
      self::insertImageDataInDb($data);
    } // end if                     
  } // end if check if filename is already in DB
} // end foreach

and my method getFilenamesFromFtp() looks like this:

$filenames = array();
$i=1;
$ftp = $this->getFtpConfiguration();

// set up basic connection
$conn_id = ftp_ssl_connect($ftp['host']);

// login with username and password
$login_result = ftp_login($conn_id, $ftp['username'], $ftp['pass']);

ftp_set_option($conn_id, FTP_USEPASVADDRESS, false);
$mode = ftp_pasv($conn_id, TRUE);
ftp_set_option($conn_id, FTP_TIMEOUT_SEC, 180);

//Login OK ?
if ((!$conn_id) || (!$login_result) || (!$mode)) { //  || (!$mode)
   die("FTP connection has failed !");
}
else{
  // I get all filenames and store them in array
  $files=ftp_nlist($conn_id, ".");
  // I count the number of files in array = the number of files on FTP 
  $nofiles=count($files);
  foreach($files as $filename){
  // the limit I implemented while developing or testing, but in production (current mode) it has to run without limit
  if(self::LIMIT>0 && $i==self::LIMIT){ //!empty(self::LIMIT) &&    
      break;
    }
    else{
      // I get date modified from from file
      $date_modified = ftp_mdtm($conn_id, $filename);
      
      // I create new array for filenames and date modified so I  can return it and store it in DB
      $filenames[]= array(
         "filename" => $filename,
         "date_modified" => $date_modified
      );
    } // end if LIMIT empty
    $i++;
  } // end foreach
  // close the connection
  ftp_close($conn_id);
  return $filenames;
}

The problem is that script takes a long time. The longest period I have detected by now is when in getFilenamesFromFtp() I create the array:

      $filenames[]= array(
         "filename" => $filename,
         "date_modified" => $date_modified
      );

That part so far lasts for 4h and is still not done.

While writing this I had an idea to remove "date modified" from the beginning and to use that later only if I am planning to store that image in DB.

I will update this question as soon as I am done with this change and test :)

Oktarin
  • 57
  • 2
  • 11

2 Answers2

1

Processing a million filenames will take time, however, I see no reason to store those file names (and date_modified) in an array, why not process a filename directly?

Also, instead of completely processing a filename, why not store it in a database table first? Then you can do the real processing later. By splitting the task in two, retrieval and processing, it becomes more flexible. For instance, you don't need to do a new retrieval if you want to change the processing.

KIKO Software
  • 15,283
  • 3
  • 18
  • 33
  • Thank you! So it is faster to store filename directly in DB and later make queries to check if the filename exits than to store it in an array? – Oktarin Apr 25 '21 at 15:05
  • Not necessarily faster, but more practical. An array takes up memory, and an array of a million files takes up a lot of memory. What if you've read 900.000 files into the array and you run out of memory? You'll have to start all over. That takes time, and is therefore slow. Just simply writing them to the database doesn't take up much memory, and less change of failure. Then, in the second stage, you can do the processing purely on the database content. – KIKO Software Apr 25 '21 at 15:56
1

If the objective is to just display new files on the webpage:

  • You can just store the highest file created/modified time from the DB.
  • This way, for the next batch, just fetch the last modified time and compare it against file created/modified time of all the files. This will make your app pretty lightweight. You can use filemtime for this.
  • Now, take highest filemtime of all current files in iteration and store the highest recorded in the DB and repeat the same above steps.

Suggestions:

foreach ($this->getFilenamesFromFtp() as $key => $image_data) {

If the above snippet gets all filenames in an array, you can discard this strategy. This would consume a lot of memory. Instead read files by one by one using directory functions as mentioned in this answer, as this one maintains an internal pointer for the handle and doesn't load all files at once. Of course, you need to make the pointed out answer follow recursive iteration as well for nested directories.

nice_dev
  • 17,053
  • 2
  • 21
  • 35
  • 1
    Thank you, very interesting. But in order to get last modified time I still have to iterate through all files in directory on remote ftp server. I have to check date for each file in order to get "new" files. – Oktarin Apr 25 '21 at 16:34
  • Yes, of course, but the overhead won't be much. – nice_dev Apr 25 '21 at 16:44
  • So I should do something like this: get_last_modifed_date from DB, foreach file from ftp check if file date is newer than the one from DB -> if yes - continue... – Oktarin Apr 25 '21 at 17:04
  • Yes, that' what I meant. In this scenario, any modified file will also have a new time but technically it is also "new". The second part of my answer is also important to make things even faster. – nice_dev Apr 25 '21 at 17:09
  • While storing the datetime, make sure it's the max of all files' datetime processed. – nice_dev Apr 25 '21 at 17:11
  • I looked at directory functions, but as I see I cannot use them with FTP connection. That is why I am using `ftp_nlist`.. or is there a way to combine directory functions with ftp? – Oktarin May 01 '21 at 08:59