I'm trying to understand how python is using memory to estimate how many processes I can run at a time. Right now I process large files on a server with large amounts of ram (~90-150GB of free RAM).
For a test, I would do things in python, then look at htop to see what the usage was.
step 1: I open a file which is 2.55GB and save it to a string
with open(file,'r') as f:
data=f.read()
Usage is 2686M
step 2: I split the file on newlines
data = data.split('\n')
usage is 7476M
step 3: I keep only every 4th line (two of the three lines I remove are of equal length to the line I keep)
data=[data[x] for x in range(0,len(data)) if x%4==1]
usage is 8543M
step 4:I split this into 20 equal chunks to run through a multiprocessing pool.
l=[]
for b in range(0,len(data),len(data)/40):
l.append(data[b:b+(len(data)/40)])
usage is 8621M
step 5: I delete data, usage is 8496M.
There are several things that are not making sense to me.
In step two, why does the memory usage go up so much when I change the string into an array. I am assuming that the array containers are much larger than the string container?
in step three why doesn't the data shrink significantly. I essentially got rid of 3/4 of my arrays and at least 2/3 of the data within the array. I would expect it to shrink accordingly. Calling the garbage collector did not make any difference.
oddly enough when I assigned the smaller array to another variable it uses less memory. usage 6605M
when I delete the old object data
: usage 6059M
This seems weird to me. Any help on shrinking my memory foot print would be appreciated.
EDIT
Okay, this is making my head hurt. Clearly python is doing some weird things behind the scenes here... and only python. I've made following script to demonstrate this using my original method and the method suggested in the answer below. Numbers are all in GB.
TEST CODE
import os,sys
import psutil
process = psutil.Process(os.getpid())
import time
py_usage=process.memory_info().vms / 1000000000.0
in_file = "14982X16.fastq"
def totalsize(o):
size = 0
for x in o:
size += sys.getsizeof(x)
size += sys.getsizeof(o)
return "Object size:"+str(size/1000000000.0)
def getlines4(f):
for i, line in enumerate(f):
if i % 4 == 1:
yield line.rstrip()
def method1():
start=time.time()
with open(in_file,'rb') as f:
data = f.read().split("\n")
data=[data[x] for x in xrange(0,len(data)) if x%4==1]
return data
def method2():
start=time.time()
with open(in_file,'rb') as f:
data2=list(getlines4(f))
return data2
print "method1 == method2",method1()==method2()
print "Nothing in memory"
print "Usage:", (process.memory_info().vms / 1000000000.0) - py_usage
data=method1()
print "data from method1 is in memory"
print "method1", totalsize(data)
print "Usage:", (process.memory_info().vms / 1000000000.0) - py_usage
del data
print "Nothing in memory"
print "Usage:", (process.memory_info().vms / 1000000000.0) - py_usage
data2=method2()
print "data from method2 is in memory"
print "method2", totalsize(data2)
print "Usage:", (process.memory_info().vms / 1000000000.0) - py_usage
del data2
print "Nothing is in memory"
print "Usage:", (process.memory_info().vms / 1000000000.0) - py_usage
print "\nPrepare to have your mind blown even more!"
data=method1()
print "Data from method1 is in memory"
print "Usage:", (process.memory_info().vms / 1000000000.0) - py_usage
data2=method2()
print "Data from method1 and method 2 are in memory"
print "Usage:", (process.memory_info().vms / 1000000000.0) - py_usage
data==data2
print "Compared the two lists"
print "Usage:", (process.memory_info().vms / 1000000000.0) - py_usage
del data
print "Data from method2 is in memory"
print "Usage:", (process.memory_info().vms / 1000000000.0) - py_usage
del data2
print "Nothing is in memory"
print "Usage:", (process.memory_info().vms / 1000000000.0) - py_usage
OUTPUT
method1 == method2 True
Nothing in memory
Usage: 0.001798144
data from method1 is in memory
method1 Object size:1.52604683
Usage: 4.552925184
Nothing in memory
Usage: 0.001798144
data from method2 is in memory
method2 Object size:1.534815518
Usage: 1.56932096
Nothing is in memory
Usage: 0.001798144
Prepare to have your mind blown even more!
Data from method1 is in memory
Usage: 4.552925184
Data from method1 and method 2 are in memory
Usage: 4.692287488
Compared the two lists
Usage: 4.692287488
Data from method2 is in memory
Usage: 4.56169472
Nothing is in memory
Usage: 0.001798144
for those of you using python3 its pretty similar, except not as bad after the comparison operation...
OUTPUT FROM PYTHON3
method1 == method2 True
Nothing in memory
Usage: 0.004395008000000006
data from method1 is in memory
method1 Object size:1.718523294
Usage: 5.322555392
Nothing in memory
Usage: 0.004395008000000006
data from method2 is in memory
method2 Object size:1.727291982
Usage: 1.872596992
Nothing is in memory
Usage: 0.004395008000000006
Prepare to have your mind blown even more!
Data from method1 is in memory
Usage: 5.322555392
Data from method1 and method 2 are in memory
Usage: 5.461917696
Compared the two lists
Usage: 5.461917696
Data from method2 is in memory
Usage: 2.747633664
Nothing is in memory
Usage: 0.004395008000000006
moral of the story... memory for python appear to be a bit like Camelot for Monty Python... 'tis a very silly place.