If we need to read/write some data from/to a large file each time before/after processing, which of the following way (with some demonstration Python codes) is better?
Open the file each time when we need to read/writing and close immediately after reading/writing. This way seems to be safer? but slower since we need to open and close a lot of times?
for i in processing_loop: with open(datafile) as f: read_data(...) process_data(...) with open(resultfile,'a') as f: save_data(...)
This looks awkward but it seems matlab takes this way in its.mat
file IO functionsload
andsave
. We callload
andsave
directly without explicitopen
norclose
.Open the file and close until we finish all the work, faster but at the risk of file remaining open if the program raises errors, or the file being corrupted if the program is terminated unexpectedly.
fr = open(datafile) fw = open(resultfile,'a') for i in processing_loop: read_data(...) process_data(...) save_data(...) fr.close() fw.close()
In fact, I had severalhdf5
files corrupted in this way when the program was killed.
Seems guys prefer the second with wrapping the loop in with
.
with open(...) as f:
...
or in an exception catch block.
I knew these two things and I did used them. But my hdf5
files were still corrupted when the program was killed.
Once I was trying to write a huge array into a hdf5 file and the program was stucked for a long time so I killed it, then the file was corrupted.
For many times, the program is ternimated because the server is suddenly down or the running time exceeds the wall time.
I didn't pay attention to if the corruption occurs only when the program is terminated while writing data to file. If so, it means the file structure is corrupted because it's incomplete. So I wander if it would be helpful to flush the data every time, which increase the IO loads but could decrease the chance of writing data to file when terminated.
I tried the first way, accessing the file only when reading/writing data is necessary. But obviously the speed was slow down. What happens in background when we open/close a file handle? Not just make/destroy a pointer? Why open/close
operations cost so much?