How to poll a directory and not hit a file-transfer race condition?

Question

I am working on an application that polls a directory for new input files at a defined interval. The general process is:

Input files FTP'd to landing strip directory by another app
Our app wakes up
List files in the input directory
Atomic-move the files to a separate staging directory
Kick off worker threads (via a work-distributing queue) to consume the files from the staging directory
Go to back sleep

I've uncovered a problem where the app will pick up an input file while it is incomplete and still in the middle of being transferred, resulting in a worker thread error, requiring manual intervention. This is a scenario we need to avoid.

I should note the file transfer will complete successfully and the server will get a complete copy, but this will happen to occur after the app has given up due to an error.

I'd like to solve this in a clean way, and while I have some ideas for solutions, they all have problems I don't like.

Here's what I've considered:

Force the other apps (some of which are external to our company) to initially transfer the input files to a holding directory, then atomic-move them into the input directory once they're transferred. This is the most robust idea I've had, but I don't like this because I don't trust that it will always be implemented correctly.
Retry a finite number of times on error. I don't like this because it's a partial solution, it makes assumptions about transfer time and file size that could be violated. It would also blur the lines between a genuinely bad file and one that's just been incompletely transferred.
Watch the file sizes and only pick up the file if its size hasn't changed for a defined period of time. I don't like this because it's too complex in our environment: the poller is a non-concurrent clustered Quartz job, so I can't just persist this info in memory because the job can bounce between servers. I could store it in the jobdetail, but this solution just feels too complicated.

I can't be the first have encountered this problem, so I'm sure I'll get better ideas here.

I don't know if the file / directory update events of Java work differently or those filesystem watchers included into Apache Camel. But I think if you try to open the file exclusively before copying it and this fails you should know that the writing process is not finished: have a look here http://stackoverflow.com/questions/128038/how-can-i-lock-a-file-using-java-if-possible — Marged, Jul 02 '15 at 21:15
Do you really have to poll ? Why not use [java nio watch service API](https://docs.oracle.com/javase/tutorial/essential/io/notification.html) instead? — Alp, Jul 02 '15 at 21:29
@Alp There's no event that would tell you another process is done writing a file. — erickson, Jul 02 '15 at 21:40
@Marged This is a good solution if the writing process uses file locks. The lock mechanism is only advisory; an application is guaranteed to detect a lock if it's present, but it's not guaranteed to prevent concurrent modification. — erickson, Jul 02 '15 at 21:42
@erickson - It's not a good solution because there's no way to tell if the file transfer of the file is complete and did not fail. The only one who knows that the transfer was successful is the sender, so there needs to be some signal sent to the receiving end that the transfer completed successfully. — Andrew Henle, Jul 03 '15 at 18:41
@AndrewHenle Do you mean that there needs to be a signal to ensure the file transmission was not interrupted (in contrast with a signal from the server process receiving and writing the file to disk, and the consumer process, reading the file from disk)? Yes, that's true, but most network protocols use something like a FIN packet (cleanly closed TCP stream) or a TLS close alert to give sufficient assurance that the content was not truncated. The internal file format may provide an extra measure of confidence. — erickson, Jul 03 '15 at 19:18
@erickson True, but if all you're doing is using an external process to poll a directory there's no way to tell that the FTP or SSH daemon, for example, stopped writing to the file because the connection was lost. The use of an internal file format to indicate an incomplete transfer is just a way of conveying the required information that the transfer was complete. It doesn't matter how that information is conveyed, but to convey the information from a FIN packet for example requires some type of integration of file detection and processing into the application that actually receives the file. — Andrew Henle, Jul 03 '15 at 19:30
Using the the file format to determine if the file is complete cannot tell you if it is still being transferred, since you could have a complete transfer of an incomplete file. I think you need an out-of-band signal to differentiate these cases; which a FIN packet sort of is since it's not part of the application byte stream. — Kaypro II, Jul 08 '15 at 21:50
@KayproII did you ever get a solution to this? I came across this today with the same issue and wondering how you managed to solve the problem? — Ryan Mortier, Nov 07 '18 at 13:22

score 3 · Answer 1 · answered Jul 02 '15 at 21:15

3

I had that situation once, we got the other guys to load the files with a different extension, e.g. *.tmp, then after the file copy is completed they rename the file with the extension that my code is polling for. Not sure if that is as easily done when the files are coming in by FTP tho.

answered Jul 02 '15 at 21:15

Les Ferguson

351
1
6

The fundamental problem is there is **no** way to unequivocally determine that the file transfer is complete and completely successful without some sort of signal from the sender. Renaming the file after it's been transferred is such a signal, and it's probably the easiest to implement. – Andrew Henle Jul 03 '15 at 18:46

How to poll a directory and not hit a file-transfer race condition?

1 Answers1