StackOverflow,
I am working on a genomics project involving some large files (10-50Gb) which I want to read into Python 2.7 for processing. I do not need to read the entire file into memory, but rather, simply read each file line-by-line, do a small task, and continue on.
I found SO questions that were similar and tried to implement a few solutions:
Efficient reading of 800 GB XML file in Python 2.7
How to read large file, line by line in python
When I run the following code on a 17Gb file:
SCRIPT 1 (itertools):
#!/usr/bin/env python2
import sys
import string
import os
import itertools
if __name__ == "__main__":
#Read in PosList
posList=[]
with open("BigFile") as f:
for line in iter(f):
posList.append(line.strip())
sys.stdout.write(str(sys.getsizeof(posList)))
SCRIPT 2 (fileinput):
#!/usr/bin/env python2
import sys
import string
import os
import fileinput
if __name__ == "__main__":
#Read in PosList
posList=[]
for line in fileinput.input(['BigFile']):
posList.append(line.strip())
sys.stdout.write(str(sys.getsizeof(posList)))
SCRIPT3 (for line):
#!/usr/bin/env python2
import sys
import string
import os
if __name__ == "__main__":
#Read in PosList
posList=[]
with open("BigFile") as f:
for line in f:
posList.append(line.strip())
sys.stdout.write(str(sys.getsizeof(posList)))
SCRIPT4 (yield):
#!/usr/bin/env python2
import sys
import string
import os
def readInChunks(fileObj, chunkSize=30):
while True:
data = fileObj.read(chunkSize)
if not data:
break
yield data
if __name__ == "__main__":
#Read in PosList
posList=[]
f = open('BigFile')
for chunk in readInChunks(f):
posList.append(chunk.strip())
f.close()
sys.stdout.write(str(sys.getsizeof(posList)))
From the 17Gb file, the size of the final list in Python is ~5Gb [ from sys.getsizeof() ], but according to 'top' each script uses upwards of 43Gb of memory.
My question is: Why does the memory usage rise so much higher than that of the input file or the final list? If the final list is only 5Gb, and the 17Gb file input is being read line-by-line, why does the memory use for each script hit ~43Gb? Is there a better way to read in large files without memory leaks (if that's what they are)?
Many thanks.
EDIT:
Output from '/usr/bin/time -v python script3.py':
Command being timed: "python script3.py"
User time (seconds): 159.65
System time (seconds): 21.74
Percent of CPU this job got: 99%
Elapsed (wall clock) time (h:mm:ss or m:ss): 3:01.96
Average shared text size (kbytes): 0
Average unshared data size (kbytes): 0
Average stack size (kbytes): 0
Average total size (kbytes): 0
Maximum resident set size (kbytes): 181246448
Average resident set size (kbytes): 0
Major (requiring I/O) page faults: 0
Minor (reclaiming a frame) page faults: 10182731
Voluntary context switches: 315
Involuntary context switches: 16722
Swaps: 0
File system inputs: 33831512
File system outputs: 0
Socket messages sent: 0
Socket messages received: 0
Signals delivered: 0
Page size (bytes): 4096
Exit status: 0
Output from top:
15816 user 20 0 727m 609m 2032 R 76.8 0.5 0:02.31 python
15816 user 20 0 1541m 1.4g 2032 R 99.6 1.1 0:05.31 python
15816 user 20 0 2362m 2.2g 2032 R 99.6 1.7 0:08.31 python
15816 user 20 0 3194m 3.0g 2032 R 99.6 2.4 0:11.31 python
15816 user 20 0 4014m 3.8g 2032 R 99.6 3 0:14.31 python
15816 user 20 0 4795m 4.6g 2032 R 99.6 3.6 0:17.31 python
15816 user 20 0 5653m 5.3g 2032 R 99.6 4.2 0:20.31 python
15816 user 20 0 6457m 6.1g 2032 R 99.3 4.9 0:23.30 python
15816 user 20 0 7260m 6.9g 2032 R 99.6 5.5 0:26.30 python
15816 user 20 0 8085m 7.7g 2032 R 99.9 6.1 0:29.31 python
15816 user 20 0 8809m 8.5g 2032 R 99.6 6.7 0:32.31 python
15816 user 20 0 9645m 9.3g 2032 R 99.3 7.4 0:35.30 python
15816 user 20 0 10.3g 10g 2032 R 99.6 8 0:38.30 python
15816 user 20 0 11.1g 10g 2032 R 100 8.6 0:41.31 python
15816 user 20 0 11.8g 11g 2032 R 99.9 9.2 0:44.32 python
15816 user 20 0 12.7g 12g 2032 R 99.3 9.9 0:47.31 python
15816 user 20 0 13.4g 13g 2032 R 99.6 10.5 0:50.31 python
15816 user 20 0 14.3g 14g 2032 R 99.9 11.1 0:53.32 python
15816 user 20 0 15.0g 14g 2032 R 99.3 11.7 0:56.31 python
15816 user 20 0 15.9g 15g 2032 R 99.9 12.4 0:59.32 python
15816 user 20 0 16.6g 16g 2032 R 99.6 13 1:02.32 python
15816 user 20 0 17.3g 17g 2032 R 99.6 13.6 1:05.32 python
15816 user 20 0 18.2g 17g 2032 R 99.9 14.2 1:08.33 python
15816 user 20 0 18.9g 18g 2032 R 99.6 14.9 1:11.33 python
15816 user 20 0 19.9g 19g 2032 R 100 15.5 1:14.34 python
15816 user 20 0 20.6g 20g 2032 R 99.3 16.1 1:17.33 python
15816 user 20 0 21.3g 21g 2032 R 99.6 16.7 1:20.33 python
15816 user 20 0 22.3g 21g 2032 R 99.9 17.4 1:23.34 python
15816 user 20 0 23.0g 22g 2032 R 99.6 18 1:26.34 python
15816 user 20 0 23.7g 23g 2032 R 99.6 18.6 1:29.34 python
15816 user 20 0 24.4g 24g 2032 R 99.6 19.2 1:32.34 python
15816 user 20 0 25.4g 25g 2032 R 99.3 19.9 1:35.33 python
15816 user 20 0 26.1g 25g 2032 R 99.9 20.5 1:38.34 python
15816 user 20 0 26.8g 26g 2032 R 99.9 21.1 1:41.35 python
15816 user 20 0 27.4g 27g 2032 R 99.6 21.7 1:44.35 python
15816 user 20 0 28.5g 28g 2032 R 99.6 22.3 1:47.35 python
15816 user 20 0 29.2g 28g 2032 R 99.9 22.9 1:50.36 python
15816 user 20 0 29.9g 29g 2032 R 99.6 23.5 1:53.36 python
15816 user 20 0 30.5g 30g 2032 R 99.6 24.1 1:56.36 python
15816 user 20 0 31.6g 31g 2032 R 99.6 24.7 1:59.36 python
15816 user 20 0 32.3g 31g 2032 R 100 25.3 2:02.37 python
15816 user 20 0 33.0g 32g 2032 R 99.6 25.9 2:05.37 python
15816 user 20 0 33.7g 33g 2032 R 99.6 26.5 2:08.37 python
15816 user 20 0 34.3g 34g 2032 R 99.6 27.1 2:11.37 python
15816 user 20 0 35.5g 34g 2032 R 99.6 27.7 2:14.37 python
15816 user 20 0 36.2g 35g 2032 R 99.6 28.4 2:17.37 python
15816 user 20 0 36.9g 36g 2032 R 100 29 2:20.38 python
15816 user 20 0 37.5g 37g 2032 R 99.6 29.6 2:23.38 python
15816 user 20 0 38.2g 38g 2032 R 99.6 30.2 2:26.38 python
15816 user 20 0 38.9g 38g 2032 R 99.6 30.8 2:29.38 python
15816 user 20 0 40.1g 39g 2032 R 100 31.4 2:32.39 python
15816 user 20 0 40.8g 40g 2032 R 99.6 32 2:35.39 python
15816 user 20 0 41.5g 41g 2032 R 99.6 32.6 2:38.39 python
15816 user 20 0 42.2g 41g 2032 R 99.9 33.2 2:41.40 python
15816 user 20 0 42.8g 42g 2032 R 99.6 33.8 2:44.40 python
15816 user 20 0 43.4g 43g 2032 R 99.6 34.3 2:47.40 python
15816 user 20 0 43.4g 43g 2032 R 100 34.3 2:50.41 python
15816 user 20 0 38.6g 38g 2032 R 100 30.5 2:53.43 python
15816 user 20 0 24.9g 24g 2032 R 99.7 19.6 2:56.43 python
15816 user 20 0 12.0g 11g 2032 R 100 9.4 2:59.44 python
Edit 2:
For further clarification, here is an expansion of the issue. What I'm doing here is reading in a list of positions in a FASTA file (Contig1/1,Contig1/2, etc.). That is being converted to a dictionary full of N's via:
keys = posList
values = ['N'] * len(posList)
speciesDict = dict(zip(keys, values))
Then, I'm reading in pileup files for multiple species, again line-by-line (where the same problem will exist), and getting the final base call via:
with open (path+'/'+os.path.basename(path)+'.pileups',"r") as filein:
for line in iter(filein):
splitline=line.split()
if len(splitline)>4:
node,pos,ref,num,bases,qual=line.split()
loc=node+'/'+pos
cleanBases=getCleanList(ref,bases)
finalBase=getFinalBase_Pruned(cleanBases,minread,thresh)
speciesDict[loc] = finalBase
Because the species-specific pileup files are not the same length, or in the same order, I am creating the list to create a 'common-garden' way to store the individual species data. If no data is available for a given site for a species, it gets an 'N' call. Otherwise, a base is assigned to the site in the dictionary.
The end result is a file for each species which is ordered and complete, from which I can do downstream analyses.
Because the line-by-line reading is eating up so much memory, reading in TWO large files will overload my resources, even though the final data structures are much smaller than the memory that I expected to be required (size of growing lists + single line at a time of data to be added).