4

StackOverflow,

I am working on a genomics project involving some large files (10-50Gb) which I want to read into Python 2.7 for processing. I do not need to read the entire file into memory, but rather, simply read each file line-by-line, do a small task, and continue on.

I found SO questions that were similar and tried to implement a few solutions:

Efficient reading of 800 GB XML file in Python 2.7

How to read large file, line by line in python

When I run the following code on a 17Gb file:

SCRIPT 1 (itertools):

#!/usr/bin/env python2

import sys
import string
import os
import itertools

if __name__ == "__main__":

    #Read in PosList
    posList=[]
    with open("BigFile") as f:
        for line in iter(f):
            posList.append(line.strip())
    sys.stdout.write(str(sys.getsizeof(posList)))

SCRIPT 2 (fileinput):

#!/usr/bin/env python2

import sys
import string
import os
import fileinput 

if __name__ == "__main__":

    #Read in PosList
    posList=[]
    for line in fileinput.input(['BigFile']):
        posList.append(line.strip())
    sys.stdout.write(str(sys.getsizeof(posList)))

SCRIPT3 (for line):

#!/usr/bin/env python2

import sys
import string
import os

if __name__ == "__main__":

    #Read in PosList
    posList=[]
    with open("BigFile") as f:
        for line in f:
            posList.append(line.strip())
    sys.stdout.write(str(sys.getsizeof(posList)))

SCRIPT4 (yield):

#!/usr/bin/env python2

import sys
import string
import os

def readInChunks(fileObj, chunkSize=30):
    while True:
        data = fileObj.read(chunkSize)
        if not data:
            break
        yield data

if __name__ == "__main__":

    #Read in PosList
    posList=[]
    f = open('BigFile')
    for chunk in readInChunks(f):
        posList.append(chunk.strip())
    f.close()
    sys.stdout.write(str(sys.getsizeof(posList)))

From the 17Gb file, the size of the final list in Python is ~5Gb [ from sys.getsizeof() ], but according to 'top' each script uses upwards of 43Gb of memory.

My question is: Why does the memory usage rise so much higher than that of the input file or the final list? If the final list is only 5Gb, and the 17Gb file input is being read line-by-line, why does the memory use for each script hit ~43Gb? Is there a better way to read in large files without memory leaks (if that's what they are)?

Many thanks.

EDIT:

Output from '/usr/bin/time -v python script3.py':

Command being timed: "python script3.py"
User time (seconds): 159.65
System time (seconds): 21.74
Percent of CPU this job got: 99%
Elapsed (wall clock) time (h:mm:ss or m:ss): 3:01.96
Average shared text size (kbytes): 0
Average unshared data size (kbytes): 0
Average stack size (kbytes): 0
Average total size (kbytes): 0
Maximum resident set size (kbytes): 181246448
Average resident set size (kbytes): 0
Major (requiring I/O) page faults: 0
Minor (reclaiming a frame) page faults: 10182731
Voluntary context switches: 315
Involuntary context switches: 16722
Swaps: 0
File system inputs: 33831512
File system outputs: 0
Socket messages sent: 0
Socket messages received: 0
Signals delivered: 0
Page size (bytes): 4096
Exit status: 0

Output from top:

15816   user    20  0   727m    609m    2032    R   76.8    0.5 0:02.31 python
15816   user    20  0   1541m   1.4g    2032    R   99.6    1.1 0:05.31 python
15816   user    20  0   2362m   2.2g    2032    R   99.6    1.7 0:08.31 python
15816   user    20  0   3194m   3.0g    2032    R   99.6    2.4 0:11.31 python
15816   user    20  0   4014m   3.8g    2032    R   99.6    3   0:14.31 python
15816   user    20  0   4795m   4.6g    2032    R   99.6    3.6 0:17.31 python
15816   user    20  0   5653m   5.3g    2032    R   99.6    4.2 0:20.31 python
15816   user    20  0   6457m   6.1g    2032    R   99.3    4.9 0:23.30 python
15816   user    20  0   7260m   6.9g    2032    R   99.6    5.5 0:26.30 python
15816   user    20  0   8085m   7.7g    2032    R   99.9    6.1 0:29.31 python
15816   user    20  0   8809m   8.5g    2032    R   99.6    6.7 0:32.31 python
15816   user    20  0   9645m   9.3g    2032    R   99.3    7.4 0:35.30 python
15816   user    20  0   10.3g   10g 2032    R   99.6    8   0:38.30 python
15816   user    20  0   11.1g   10g 2032    R   100 8.6 0:41.31 python
15816   user    20  0   11.8g   11g 2032    R   99.9    9.2 0:44.32 python
15816   user    20  0   12.7g   12g 2032    R   99.3    9.9 0:47.31 python
15816   user    20  0   13.4g   13g 2032    R   99.6    10.5    0:50.31 python
15816   user    20  0   14.3g   14g 2032    R   99.9    11.1    0:53.32 python
15816   user    20  0   15.0g   14g 2032    R   99.3    11.7    0:56.31 python
15816   user    20  0   15.9g   15g 2032    R   99.9    12.4    0:59.32 python
15816   user    20  0   16.6g   16g 2032    R   99.6    13  1:02.32 python
15816   user    20  0   17.3g   17g 2032    R   99.6    13.6    1:05.32 python
15816   user    20  0   18.2g   17g 2032    R   99.9    14.2    1:08.33 python
15816   user    20  0   18.9g   18g 2032    R   99.6    14.9    1:11.33 python
15816   user    20  0   19.9g   19g 2032    R   100 15.5    1:14.34 python
15816   user    20  0   20.6g   20g 2032    R   99.3    16.1    1:17.33 python
15816   user    20  0   21.3g   21g 2032    R   99.6    16.7    1:20.33 python
15816   user    20  0   22.3g   21g 2032    R   99.9    17.4    1:23.34 python
15816   user    20  0   23.0g   22g 2032    R   99.6    18  1:26.34 python
15816   user    20  0   23.7g   23g 2032    R   99.6    18.6    1:29.34 python
15816   user    20  0   24.4g   24g 2032    R   99.6    19.2    1:32.34 python
15816   user    20  0   25.4g   25g 2032    R   99.3    19.9    1:35.33 python
15816   user    20  0   26.1g   25g 2032    R   99.9    20.5    1:38.34 python
15816   user    20  0   26.8g   26g 2032    R   99.9    21.1    1:41.35 python
15816   user    20  0   27.4g   27g 2032    R   99.6    21.7    1:44.35 python
15816   user    20  0   28.5g   28g 2032    R   99.6    22.3    1:47.35 python
15816   user    20  0   29.2g   28g 2032    R   99.9    22.9    1:50.36 python
15816   user    20  0   29.9g   29g 2032    R   99.6    23.5    1:53.36 python
15816   user    20  0   30.5g   30g 2032    R   99.6    24.1    1:56.36 python
15816   user    20  0   31.6g   31g 2032    R   99.6    24.7    1:59.36 python
15816   user    20  0   32.3g   31g 2032    R   100 25.3    2:02.37 python
15816   user    20  0   33.0g   32g 2032    R   99.6    25.9    2:05.37 python
15816   user    20  0   33.7g   33g 2032    R   99.6    26.5    2:08.37 python
15816   user    20  0   34.3g   34g 2032    R   99.6    27.1    2:11.37 python
15816   user    20  0   35.5g   34g 2032    R   99.6    27.7    2:14.37 python
15816   user    20  0   36.2g   35g 2032    R   99.6    28.4    2:17.37 python
15816   user    20  0   36.9g   36g 2032    R   100 29  2:20.38 python
15816   user    20  0   37.5g   37g 2032    R   99.6    29.6    2:23.38 python
15816   user    20  0   38.2g   38g 2032    R   99.6    30.2    2:26.38 python
15816   user    20  0   38.9g   38g 2032    R   99.6    30.8    2:29.38 python
15816   user    20  0   40.1g   39g 2032    R   100 31.4    2:32.39 python
15816   user    20  0   40.8g   40g 2032    R   99.6    32  2:35.39 python
15816   user    20  0   41.5g   41g 2032    R   99.6    32.6    2:38.39 python
15816   user    20  0   42.2g   41g 2032    R   99.9    33.2    2:41.40 python
15816   user    20  0   42.8g   42g 2032    R   99.6    33.8    2:44.40 python
15816   user    20  0   43.4g   43g 2032    R   99.6    34.3    2:47.40 python
15816   user    20  0   43.4g   43g 2032    R   100 34.3    2:50.41 python
15816   user    20  0   38.6g   38g 2032    R   100 30.5    2:53.43 python
15816   user    20  0   24.9g   24g 2032    R   99.7    19.6    2:56.43 python
15816   user    20  0   12.0g   11g 2032    R   100 9.4 2:59.44 python

Edit 2:

For further clarification, here is an expansion of the issue. What I'm doing here is reading in a list of positions in a FASTA file (Contig1/1,Contig1/2, etc.). That is being converted to a dictionary full of N's via:

keys = posList
values = ['N'] * len(posList)
speciesDict = dict(zip(keys, values))

Then, I'm reading in pileup files for multiple species, again line-by-line (where the same problem will exist), and getting the final base call via:

with open (path+'/'+os.path.basename(path)+'.pileups',"r") as filein:
    for line in iter(filein):
        splitline=line.split()
        if len(splitline)>4:
            node,pos,ref,num,bases,qual=line.split()
            loc=node+'/'+pos
            cleanBases=getCleanList(ref,bases)
            finalBase=getFinalBase_Pruned(cleanBases,minread,thresh)
            speciesDict[loc] = finalBase

Because the species-specific pileup files are not the same length, or in the same order, I am creating the list to create a 'common-garden' way to store the individual species data. If no data is available for a given site for a species, it gets an 'N' call. Otherwise, a base is assigned to the site in the dictionary.

The end result is a file for each species which is ordered and complete, from which I can do downstream analyses.

Because the line-by-line reading is eating up so much memory, reading in TWO large files will overload my resources, even though the final data structures are much smaller than the memory that I expected to be required (size of growing lists + single line at a time of data to be added).

jazz710
  • 41
  • 4
  • 1
    Can you post the top output? – Klaus D. Jan 22 '18 at 15:48
  • 1
    You say you don't want to read the whole file into memory, but you basically are doing so by appending all striped lines to `posList` (unless you have lots of leading/trailing whitespace); SCRIPT3 is most pythonic normally BTW – Chris_Rands Jan 22 '18 at 16:00
  • There are EOL characters that need to be stripped in this case. For this particular step, you're right in that it doesn't really matter whether I read it all at once or line by line since I'm writing the entire file to a list anyhow. However, in downstream steps, I do need to read them in line-by-line to do some sorting and the same problem exists. I chose this part of the script because it's indicative of my issue, but not burdened by genomics-y things. – jazz710 Jan 22 '18 at 16:05
  • I agree with @Chris_Rands, why you need to append file lines to any list. Can't you perform whatever operation you want to perform on line directly? – Gaurang Shah Jan 22 '18 at 16:05
  • Feel free to burden us with genomics things if it helps understand the real issue, there's a [tag:bioinformatics] tag too – Chris_Rands Jan 22 '18 at 16:08
  • Clarification and bioinformatics tag added, thanks. – jazz710 Jan 22 '18 at 16:33
  • @GaurangShah Thanks for your comment. I have made some clarifications. In a downstream step, I do what you suggest here which is read in the line, do an operation, and move on. The issue that I'm facing is that the memory required for this seems to be out of sync with my expectations. I believe Python is keeping something in memory that I'm not aware of because as written, I would expect that memory usage would be: 1) Size of growing list 2) Single line being read in Memory usage goes well above the size of the file and list combined (43Gb to read a 17Gb file and make a 5Gb list) – jazz710 Jan 22 '18 at 16:46
  • For what it's worth, I feel that the issue is that Python is keeping some pointers or something in memory as the file is being read in line-by-line. When I run the complete script, with the additional steps mentioned in Edit 2, the memory usage after reading in the first list (43Gb) is still in use via 'top' even though the final list is only 5Gb. Then I have to read in a 25Gb file which eventually saturates my memory allotment (120Gb). If it was a matter of too big of data for my system, I'd relent, but together, the data is only ~40Gb all together, so the missing memory is a real issue. – jazz710 Jan 22 '18 at 17:10
  • This ...does not seem like the right approach. If the pileup data is from mappings to the same reference sequence, all you would need to do is store the size of each reference sequence then read the pileup data and print the final base or a series of Ns whenever you hit a gap. If the reference sequences aren't sorted in the pileup, there are a number of robust tools that can sort the mapping files upstream without blowing up your memory (ie; samtools or picardtools). – heathobrien Jan 22 '18 at 17:18
  • @heathobrien, thank you for your comment. This is part of a project (https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-015-0632-y) which requires maintaining certain metadata and imposing certain filtering criteria not implemented as part of pre-existing phylogenetics packages. As such, I do need to process these files outside of the traditional methods. I wanted to pitch this as more of a Python question than a bioinformatics question in order to avoid confusion, but further details were requested from SO. Thanks again. – jazz710 Jan 22 '18 at 17:35
  • Probably nothing to do with your issue, but you could re-use `splitline` here: `node,pos,ref,num,bases,qual=line.split()` -> `node, pos, ref, num, bases, qual = splitline`. – bli Jan 23 '18 at 12:53
  • SCRIPT 1 imports itertools but does not use it. – bli Jan 23 '18 at 12:59

2 Answers2

2

sys.getsizeof(posList) is not giving you what I think you think it is: it's telling you the size of the list object containing the lines; this does not include the size of the lines themselves. Below are some outputs from reading a roughly 3.5Gb file into a list on my system:

In [2]: lines = []

In [3]: with open('bigfile') as inf:
   ...:     for line in inf:
   ...:         lines.append(line)
   ...:
In [4]: len(lines)
Out[4]: 68318734

In [5]: sys.getsizeof(lines)
Out[5]: 603811872

In [6]: sum(len(l) for l in lines)
Out[6]: 3473926127

In [7]: sum(sys.getsizeof(l) for l in lines)
Out[7]: 6001719285

That's a bit over six billion bytes, there; in top my interpreter was using about 7.5Gb at this point.

Strings have considerable overhead: 37 bytes each, it looks like:

In [2]: sys.getsizeof('0'*10)
Out[2]: 47

In [3]: sys.getsizeof('0'*100)
Out[3]: 137

In [4]: sys.getsizeof('0'*1000)
Out[4]: 1037

So if your lines are relatively short, a large part of the memory use will be overhead.

Nathan Vērzemnieks
  • 5,495
  • 1
  • 11
  • 23
1

While not directly addressing the question as to why there is such a memory overhead, which @nathan-vērzemnieks answers, a highly efficient solution to your problem may be to use a Python bitarray:

https://pypi.python.org/pypi/bitarray

I have used this in the past to store presence/absence of thousands of DNA motifs in 16S RNA from over 250,000 species from the SILVA database. It basically encodes your Y/N flag into a single bit, instead of using overhead associated with storing the character Y or N.

Vince
  • 3,325
  • 2
  • 23
  • 41