2

I want to build a script which finds out which files on an FTP server are new and which are already processed.
For each file on the FTP we read out the information, parse it and write the information we need from it to our database. The files are xml-files, but have to be translated.

At the moment I'm using mlsd() to get a list, but this takes up to 4 minutes because there are already 15.000 files in this directory - it will be more everyday.

Instead of comparing this list with an older list which I saved in a textfile I would like to know if there are better possibilities.
Because this task has to run "live" it would end in an cronjob every 1 or 2 minutes. If this method takes to long this won't work.

The solution should be either in PHP or Python.

def handle(self, *args, **options):
    ftp = FTP_TLS(host=host)
    ftp.login(user,passwd)
    ftp.prot_p()
    list = ftp.mlsd("...")
    for item in list:
       print(item[0] + " => " + item[1]['modify'])

This code examples already runs 4 minutes.

Martin Prikryl
  • 188,800
  • 56
  • 490
  • 992
Rune
  • 61
  • 5
  • Instead of comparing a list with another list, I would suggest to save the timestamp of the last query and look for files that were created since that timestamp. – J. Ghyllebert Jan 03 '19 at 06:33
  • @J.Ghyllebert For that you still have to use `mlsd`, so I do not think it solves OP's problem. – Martin Prikryl Jan 03 '19 at 06:46
  • thats what i was thinking about before. newest idea is to run mlsd -> make a list -> compress already parsed files in a backup.zip and delete the single files. next time running mlsd it will exclude the zip and runtime should be better? – Rune Jan 03 '19 at 06:49
  • @Rune instead of creating a zip every time, you could as well move processed files to another dir. – J. Ghyllebert Jan 03 '19 at 07:00
  • Moving is good idea. Zipping is not, as you cannot ZIP files on FTP server. You would have to download them, delete the remote copy, zip locally and upload the zip back (what makes little sense to me). -- Though none of these is really an answer to *"Find out differences between dirlist on time A and time B on ftp"* -- If you are looking for solutions like these, you should really change your question title. – Martin Prikryl Jan 03 '19 at 07:13
  • the question is still the same, my comment was just an idea to achieve a solution for the problem mentioned in the question. Perhaps the question could be changed to "how to find out which files on the ftp are new" – Rune Jan 03 '19 at 07:18

2 Answers2

0

If FTP is your only interface to the server, there's no better way that what you are already doing.

Except maybe, if you server supports non-standard -t switch to LIST/NLST commands, which returns the list sorted by timestamps.
See How to get files in FTP folder sorted by modification time.

And if what takes long is the download of the file list (not initiation of the download). In that case you can request sorted list, but download only the leading new files, aborting the listing once you find the first already processed file.

For an example, how to abort download of a file list, see:
Download the first N rows of text file in ftp with ftplib.retrlines

Something like this:

class AbortedListing(Exception):
    pass

def collectNewFiles(s):
    if isProcessedFile(s): # your code to detect if the file was processed already
        print("We know this file already: " + s + " - aborting")
        raise AbortedListing()
    print("New file: " + s)

try:
    ftp.retrlines("NLST -t /path", collectNewFiles)
except AbortedListing:
    # read/skip response
    ftp.getmultiline()
Martin Prikryl
  • 188,800
  • 56
  • 490
  • 992
  • I will try if my ftp supports this method. nice and clean idea. Perhaps I would improve it by saving a textfile to the ftp with the timestamp of the last retrieved item? So I only have to compare the modification times of the sorted list with the one which is in the textfile and abort if its older. – Rune Jan 03 '19 at 08:07
  • Hi, I've chosen the easiest way even if I found your one nicer. I'm just copying already parsed files to a subdirectory now. – Rune Jan 09 '19 at 10:47
0

I have always tried to avoid browsing a folder to find what could have changed. I prefered setting a dedicated workflow. When files can only be added (or new versions of existing files), I tried to use a workflow where files are added in one directory and then go in other directories where they are archived. Processing can occur in a directory where files are deleted after being used, or when they are copied/moved from a folder to an other one.

As a slight goody, I also use a copy/rename pattern: the files are first copied using a temporary name (for example a .t prefix or suffix) and renamed when the copy has ended. This prevents trying to process a file which is not fully copied. Ok it used to be more important when we had slow lines, but race conditions should be avoided as much as possible, and it allows to use daemon which polls a folder every 10 seconds or less.

Unsure whether it is really relevant here because it could require some refactoring, but it gives bullet proof solutions.

Serge Ballesta
  • 143,923
  • 11
  • 122
  • 252