I have some Python code that goes roughly like this, using some libraries that you may or may not have:
# Open it for writing
vcf_file = open(local_filename, "w")
# Download the region to the file.
subprocess.check_call(["bcftools", "view",
options.truth_url.format(sample_name), "-r",
"{}:{}-{}".format(ref_name, ref_start, ref_end)], stdout=vcf_file)
# Close parent process's copy of the file object
vcf_file.close()
# Upload it
file_id = job.fileStore.writeGlobalFile(local_filename)
Basically, I'm starting a subprocess that's supposed to go download some data for me and print it to standard out. I'm redirecting that data to a file, and then, as soon as the subprocess call returns, I'm closing my handle to the file and then copying the file elsewhere.
I'm observing that, sometimes, the tail end of the data I'm expecting isn't making it into the copy. Now, it's possible that bcftools
is just occasionally not writing that data, but I'm worried that I might be doing something unsafe and somehow getting access to the file after subprocess.check_call()
has returned, but before the data that the child process writes to standard output makes it onto the disk where I can see it.
Looking at the C standard (since bcftools is implemented in C/C++), it looks like when a program exits normally, all open streams (including standard output) are flushed and closed. See the [lib.support.start.term]
section here, describing the behavior of exit()
, which is called implicitly when main()
returns:
--Next, all open C streams (as mediated by the function signatures declared in ) with unwritten buffered data are flushed, all open C streams are closed, and all files created by calling tmp- file() are removed.30)
--Finally, control is returned to the host environment. If status is zero or EXIT_SUCCESS, an implementation-defined form of the status successful termination is returned. If status is EXIT_FAILURE, an implementation-defined form of the status unsuccessful termination is returned. Otherwise the status returned is implementation-defined.31)
So before the child process exits, it closes (and thus flushes) standard output.
However, the manual page for Linux close(2)
notes that closing a file descriptor does not necessarily guarantee that any data written to it has actually made it to disk:
A successful close does not guarantee that the data has been successfully saved to disk, as the kernel defers writes. It is not common for a filesystem to flush the buffers when the stream is closed. If you need to be sure that the data is physically stored, use fsync(2). (It will depend on the disk hardware at this point.)
Thus, it would appear that, when a process exits, its standard output stream is flushed, but if that stream is actually backed by a file descriptor pointing to a file on disk, the write to disk is not guaranteed to have completed. I suspect that that may be what is going on here.
So, my actual questions:
Is my reading of the specs correct? Can a child process appear to its parent to have terminated before its redirected standard output is available on disk?
Is it possible to somehow wait until all data written by the child process to files has actually been synced to disk by the OS?
Should I be calling
flush()
or some Python version offsync()
on the parent process's copy of the file object? Can that force writes to the same file descriptor by child processes to be committed to disk?