2

Recently I ran a code in R that wrote to a TSV table and only then realized that I set up a p-value threshold too generously, resulting in ~300 GB table. Not wanting to load this entire file into a dataframe to filter, I wrote the following script in python to filter the file line-by-line.

infile = 'table.tsv' # ~ 300 GB
outfile = 'table.filtered.tsv'
from itertools import islice
with open(infile, 'r') as inf:
    with open(outfile, 'a') as outf:
        for line in islice(inf, 1, None): # starting from the second line
            if float(line.split('\t')[4]) <= 1e-7:
                outf.write(line)

It runs just fine for a while then just stops writing for no apparent reason after 7.6 GB of output. Several runs - same result. I ran it on a remote Ubuntu server with ample disk space.

I tried adding outf.flush() and it had no effect. I then added print(line) and the code was still printing table contents after it stopped writing the output, so it's neither the fault of the islice() function, nor the failure to continue to the next iteration of the loop.

I wasn't able to find any documenation about this behaviour, except for a similar question on Stack Overflow. In fact, I did the exact thing that this answer recommended.

What might be the reason for this behavior?

eDonkey
  • 606
  • 2
  • 7
  • 25
BolbatAV
  • 37
  • 2
  • 1
    How large does the file get in *bytes*? What values do the `write` calls return when the file doesn't grow anymore? – Kelly Bundy Mar 03 '23 at 10:33
  • The input table is 315506567969 bytes, the output one is 8142586140 bytes. The 10 last returned values from write() are: 109 104 114 107 113 105 108 105 108 108. So, stops at 108. – BolbatAV Mar 03 '23 at 10:44
  • Are you sure the `if` condition is evaluating as true? Maybe there's a big chunk of "bad" rows that your code is just skipping? Putting some output in the `else` might be a good thing to test (or just removing the conditional till you know you have no issues with writing)? – Amadan Mar 03 '23 at 10:46
  • Does the Ubuntu server have 8GB of RAM max ? The link given by @BolbatAV indicates that depending on the implementation of the file IO, changes may be kept in RAM until `close()` is called. Maybe you need to to write in batches, a few GB at a time. – SpaceBurger Mar 03 '23 at 10:48
  • [This link](https://stackoverflow.com/questions/34339272/python-size-limitations-on-writing-to-a-file) also suggests that a bad character may stop the engine from writing to the file. Maybe make sure to set appropriate encoding values. – SpaceBurger Mar 03 '23 at 10:52
  • @Amadan Thanks for the advice, now I found the problem. You won't believe how how embarrassed I feel, when you read what it was. -_- – BolbatAV Mar 03 '23 at 10:57

1 Answers1

0

Well the after some more poking around the if statement, the problem was unexpectedly simple. The package I used to generate the input table (MatrixEQTL if you wonder) writes its output already sorted, which is counterintuitive and never actually mentioned in the documentation. The code works, it just ran out of small enough values to grab. Thanks everyone for the advice.

BolbatAV
  • 37
  • 2
  • Even if the file *wasn't* sorted, you'd get the same amount of output; it would just take longer. The problem here is that you didn't have an accurate idea of either what your input was or what your output *should* be. – chepner Mar 03 '23 at 14:45