Identify new files in FTP and write them to AWS S3

Question

I'm currently using ftplib in Python to get some files and write them to S3.

The approach I'm using is to use with open as shown below:

with open('file-name', 'wb') as fp:
        ftp.retrbinary('filename', fp.write)

to download files from FTP server and save them in a temporary folder, then upload them to S3.

I wonder if this is the best practice, because the shortcoming about this approach is:

if files are too many&big, I can download them and upload to S3, then delete them from the temp folder, but the question is if I run this script once a day, I have to download everything again, so how can I check if a file is already been downloaded & existed in S3 so that the script will only process the new-added files in FTP?

Hope this makes sense, would be great if anyone has an example or something, many thanks.

I do not think your question title summarized your problem. You do not have a problem with *"getting files from FTP and writing them to AWS S3"*. You seem to have that resolved. You have problem with finding what files are new on FTP. — Martin Prikryl, Feb 24 '21 at 10:51
Anyway, this may help you: [How to get FTP file's modify time using Python ftplib](https://stackoverflow.com/q/29026709/850848). — Martin Prikryl, Feb 24 '21 at 10:53

Allan Wind · Accepted Answer · 2021-02-24T10:40:29.960

2

You cache the fact that you processed a given file path to persistent storage (say, a SQLite database). If the file may change after you processed it, you may be able to detect this by also caching the timestamp from FTP.dir() and/or size FTP.size(filename). If that doesn't work, you also cache a checksum (say, SHA256) of the file, then you download the file again to recalculate the checksum to see if the file changed. s3 might support a conditional upload (etag) in which case you would calculate the etag of the file, then upload it with that header set ideally with an 'Expect: 100-continue' header to see if it already got the file before you try upload data.

edited Feb 24 '21 at 10:40

answered Feb 24 '21 at 10:28

Allan Wind

23,068
5
28
38

Thanks, is there any example code for this? – wawawa Feb 24 '21 at 10:40
There are open source ftp to s3 sync options like https://github.com/vangheem/sync-ftp-to-s3/blob/master/sync-ftp-to-s3.py. I don't know if they implement the logic you are looking for. – Allan Wind Feb 24 '21 at 10:43

Identify new files in FTP and write them to AWS S3

1 Answers1