Recently I ran a code in R that wrote to a TSV table and only then realized that I set up a p-value threshold too generously, resulting in ~300 GB table. Not wanting to load this entire file into a dataframe to filter, I wrote the following script in python to filter the file line-by-line.
infile = 'table.tsv' # ~ 300 GB
outfile = 'table.filtered.tsv'
from itertools import islice
with open(infile, 'r') as inf:
with open(outfile, 'a') as outf:
for line in islice(inf, 1, None): # starting from the second line
if float(line.split('\t')[4]) <= 1e-7:
outf.write(line)
It runs just fine for a while then just stops writing for no apparent reason after 7.6 GB of output. Several runs - same result. I ran it on a remote Ubuntu server with ample disk space.
I tried adding outf.flush()
and it had no effect. I then added print(line)
and the code was still printing table contents after it stopped writing the output, so it's neither the fault of the islice()
function, nor the failure to continue to the next iteration of the loop.
I wasn't able to find any documenation about this behaviour, except for a similar question on Stack Overflow. In fact, I did the exact thing that this answer recommended.
What might be the reason for this behavior?