Identifying missing files in S3 that are not downloaded from FTP

Question

I have some Python code, it download some files from FTP and write them to AWS S3 bucket. Now I want to avoid any missing files, so my current approach is:

-> list all the available files in FTP, add the filenames into a list list_1 -> list all the files in S3 bucket, add the filenames into a list list_2 -> then compare list_1 and list_2, identify missing files that's not been downloaded to s3 -> download those missing files.

The issue is these code need to be run every hour, and there are quite a lot of files in FTP, so the first step (to list the filenames in FTP) takes a long time to run (I have a separate question for this: link). Does anyone have other better ideas to improve this logic and may be faster to execute?

score 1 · Accepted Answer · answered Aug 09 '21 at 15:45

See How to check for change in the directory at FTP server? => There's hardly a better way than what you are doing currently.

The only possible optimization is if you server support -t switch to the LIST command, which you can use to obtain file list sorted by modification time. See How to get files in FTP folder sorted by modification time (PHP question but relevant nevertheless).

You can abort the listing at the moment you encounter the first file, which is already at the S3. Of course that's only true, if you want to upload new files only, not any file (even old) added to the FTP.

Identifying missing files in S3 that are not downloaded from FTP

1 Answers1