0

I need to automate the pull(get) of files from a big variety across different FTP services spread on different domains and that receive files on 24/7 basis.

My problem is that FTP services, in general, allow the download of a file while the file is yet being uploaded. This is one of the references to the problem that can be find at internet.

This can lead to incomplete file download.

I try replicate the situation using a windows server and a ftp FileZilla client and got half of the file as expected, so no safe mechanism was in place to prevent this. So maybe simple there is no way to prevent it from the client side.

So my question is if there is some anchor, something my client can test to check for sure that the ftp server already as the totality of the file.

I found hard to believe that a protocol has old as ftp don't provide safe mechanism, so i must be missing something, or this it is by design.

Update I am developing the automation in C#, but any technical tip can help. The solution need to bee fool prof because it is critical for the business.

update2 The upload are made by the many different clients, so it is impossible to establish a convention with all.

update3 This question is similar to question How to detect that a file is being uploaded over FTP, but has the additional restriction presented at update2.

MiguelSlv
  • 14,067
  • 15
  • 102
  • 169
  • 2
    What kind of mechanism would you expect ftp to impose here? It's opening a file on disk and reading from it, if the file system isn't preventing it from doing this, how would the ftp server know that the file is actually busy? And no, you can't really fix this client-side, this has to be done server-side. In short, configure the server software in such a way that the files you download aren't the same as those that are being uploaded. – Lasse V. Karlsen Feb 09 '17 at 14:26
  • 1
    I think this is not programming related, see [this answer](http://stackoverflow.com/a/29249203/579895) – Pikoh Feb 09 '17 at 14:26
  • @LasseV.Karlsen, as a service the FTP service controls the 2. Also, should know when the file is complete. – MiguelSlv Feb 09 '17 at 14:28

4 Answers4

2

I created the following automated solution based on inputs from answers at this post and others too, to address my problem as it is, meaning: Pull files from different FTP servers, from different brands,in a scenario where concurrency is much like to happen.

Using signal files or other mechanisms suggest in this post would require force clients to change the way they interact with us, so it is a solution for most cases but not a solution for my particular problem.

So, my solution was:

  1. scan the folder parsing filename, data and size of each file.
  2. discard any file that is too new. Only if file date is older than a few minutes it is considered for download. Hangs may cause this rule to fail preventing concurrency.
  3. Rename the file. It it fails, jump out. This method, based on concurrency, has proven to be 100% accurate so far.
  4. download the renamed file.
  5. check size of transfer and see if match the size attribute (paranoia check)
  6. delete the successful transferred file from the ftp server.

This solution allow us to poll ftp folders intensively.

MiguelSlv
  • 14,067
  • 15
  • 102
  • 169
0

I believe that from the client side, there's not much you can do.

At most, you could re-check the file size after some time and see whether it had changed and take whatever steps are required to get the new content.

Mihai Caracostea
  • 8,336
  • 4
  • 27
  • 46
  • i though in renaming the file and then renaming back, but could cause other problems. – MiguelSlv Feb 09 '17 at 14:33
  • @ByteArtisan The only fullproof way would be If you could control the upload process. That way you could start the upload with a temporary name or extension (ex. myfile.txt.temp) and only after the upload would succeed, you'd then rename it to the proper file name (myfile.txt). The reader would then know to ignore files like *.temp. However, this means having a convention between uploaders and downloaders, which I understand you don't have the possibility to enforce. – Mihai Caracostea Feb 09 '17 at 14:38
  • unfortunately it is not the case. – MiguelSlv Feb 09 '17 at 14:40
  • @ByteArtisan To alleviate the issue, you could also download files only if they have a timestamp older than, say, 30 minutes. This way you'd reduce some of the conflicts, depending on the time interval you choose. However, you might run into clock sync issues. – Mihai Caracostea Feb 09 '17 at 14:42
  • would work, but adds 30 minutes to the process of each file. Not sure if it would be 100% full prof, unless the server closes the connection if the uploader hangs for less that time. – MiguelSlv Feb 09 '17 at 14:51
  • @ByteArtisan As I said... it would only alleviate the issue (not fix it) more or less, depending on the interval you choose. You need to balance the urgency of the download and fact that as time passes, the risk of conflict also decreases. – Mihai Caracostea Feb 09 '17 at 14:54
0

FTP was not a designed as a protocol for kind of real time exchange of data between two clients using the FTP server. There is no kind of notification to a client if a file intended for download is still uploaded nor is their any indication when overwriting a file that somebody currently downloads this file. This is not a design error in the FTP protocol. The real problem is that you are trying to use a protocol for a purpose it was not designed for.

Steffen Ullrich
  • 114,247
  • 10
  • 131
  • 172
  • Simple case: Entity A puts the files at the ftp server, entity B gets the files. No time is accorded between entity A and B. Files are big and take 1 hour to upload. How B should handle the download? – MiguelSlv Feb 09 '17 at 15:01
  • @ByteArtisan: again, FTP was not designed for this use case. Thus don't complain that it does not provide what you want but use a protocol which provides what you need. You would need a protocol which supports locking of data, like WebDAV or you would need to add your own fragile locking logic on top of FTP using helper files. – Steffen Ullrich Feb 09 '17 at 16:45
  • By know i think you are right about that. I don't think my use case, or the simple on, are uncommon, so that's why i am reluctant about it. Anyway, i can force clients to change to other protocols, so will have to go for a trick. – MiguelSlv Feb 09 '17 at 17:14
0

So you have this scenario:

[Publisher] --uploads file--> [FTP Server] --downloads file--> [You]

You have a publisher who is uploading files to an FTP server, and you download from the same FTP server. There can also be different FTP Server instances, one for upload and one for download, looking at the same directory, but that doesn't change much.

Now because you're looking at the same directory, you, the downloader, see files as soon as the filesystem entry is created - when the first bytes from the publisher may even still be in flight.

There are basically three solutions for this:

  • Sentinel files, written by the FTP server or a plugin. Either a "$originalFileName.lock" that exists while the file is being uploaded, or a "$originalFileName.done", that is written when the upload successfully completes.
  • Moving files to different directories: the FTP server moves the files from the upload directory where the publisher writes to the download directory from which you read.
  • The least stable: check for file size and time. When you start a download, remember the timestamp and size of the file that the FTP server reports. When you're done downloading the file, compare your values against the remembered ones. When they don't match, resume the donwload from where you finished to obtain the remaining bytes, ad inifitum. You can for example determine "A file is successfully uploaded if it hasn't grown in size for five minutes", but that's not very robust - and can cause you to wait five minutes for nothing.
CodeCaster
  • 147,647
  • 23
  • 218
  • 272
  • Moving files may fail against due to permissions settings and the last one is a bit tricky. Its the kind of tricks i am trying to avoid. If i simple try to delete the file from the server after lost the stream, should i always get an error from the server if the upload is still in process? FTP server are passing areas, so that i can do. – MiguelSlv Feb 09 '17 at 15:21