2

I'm currently using ftplib in Python to get some files and write them to S3.

The approach I'm using is to use with open as shown below:

with open('file-name', 'wb') as fp:
        ftp.retrbinary('filename', fp.write)

to download files from FTP server and save them in a temporary folder, then upload them to S3.

I wonder if this is the best practice, because the shortcoming about this approach is:

if files are too many&big, I can download them and upload to S3, then delete them from the temp folder, but the question is if I run this script once a day, I have to download everything again, so how can I check if a file is already been downloaded & existed in S3 so that the script will only process the new-added files in FTP?

Hope this makes sense, would be great if anyone has an example or something, many thanks.

wawawa
  • 2,835
  • 6
  • 44
  • 105
  • I do not think your question title summarized your problem. You do not have a problem with *"getting files from FTP and writing them to AWS S3"*. You seem to have that resolved. You have problem with finding what files are new on FTP. – Martin Prikryl Feb 24 '21 at 10:51
  • 1
    Anyway, this may help you: [How to get FTP file's modify time using Python ftplib](https://stackoverflow.com/q/29026709/850848). – Martin Prikryl Feb 24 '21 at 10:53

1 Answers1

2

You cache the fact that you processed a given file path to persistent storage (say, a SQLite database). If the file may change after you processed it, you may be able to detect this by also caching the timestamp from FTP.dir() and/or size FTP.size(filename). If that doesn't work, you also cache a checksum (say, SHA256) of the file, then you download the file again to recalculate the checksum to see if the file changed. s3 might support a conditional upload (etag) in which case you would calculate the etag of the file, then upload it with that header set ideally with an 'Expect: 100-continue' header to see if it already got the file before you try upload data.

Allan Wind
  • 23,068
  • 5
  • 28
  • 38
  • Thanks, is there any example code for this? – wawawa Feb 24 '21 at 10:40
  • There are open source ftp to s3 sync options like https://github.com/vangheem/sync-ftp-to-s3/blob/master/sync-ftp-to-s3.py. I don't know if they implement the logic you are looking for. – Allan Wind Feb 24 '21 at 10:43