I'm trying to run a few scripts analyzing data with Python and I've been quickly surprised by how much RAM space it takes:
My script reads two columns of integers from a file. It imports it in the following way:
import numpy as N
from sys import argv
infile = argv[1]
data = N.loadtxt(infile,dtype=N.int32) //infile is the input file
For a file with almost 8 million lines, it takes around 1.5 Gb in ram (at this stages all it does is importing the data).
I tried running a memory profiler on it, giving me:
Line # Mem usage Increment Line Contents
5 17.664 MiB 0.000 MiB @profile
6 def func():
7 17.668 MiB 0.004 MiB infile = argv[1]
8 258.980 MiB 241.312 MiB data = N.loadtxt(infile,dtype=N.int32)
so 250Mb for the data, far from the 1.5Gb in memory (what is occupying so much space?)
and when I tried dividing it by 2 by using int16 instead of int32:
Line # Mem usage Increment Line Contents
5 17.664 MiB 0.000 MiB @profile
6 def func():
7 17.668 MiB 0.004 MiB infile = argv[1]
8 229.387 MiB 211.719 MiB data = N.loadtxt(infile,dtype=N.int16)
But I'm only saving a tenth, how come?
I don't know much of memory occupation, but is this normal?
Also, I coded the same thing in C++ storing the data in vector<int>
objects and it only takes 120Mb in RAM.
To me, Python seems to sweep a lot under the rug when it comes to handling the memory, what is it doing that inflates the weight of the data? Is it more related to Numpy?
Inspired by the answer below, I'm now importing my data the following way:
infile = argv[1]
output = commands.getoutput("wc -l " + infile) #I'm using the wc linux command to read the number of lines in my file and so how much memory allocation do I need
n_lines = int(output.split(" ")[0]) #the first int is the number of lines
data = N.empty((n_lines,2),dtype=N.int16) #allocating
datafile = open(infile)
for count,line in enumerate(datafile): #reading line by line
data[count] = line.split(" ") #filling the array
It also works very similarly with multiple files:
infiles = argv[1:]
n_lines = sum(int(commands.getoutput("wc -l " + infile).split(" ")[0]) for infile in infiles)
i = 0
data = N.empty((n_lines,2),dtype=N.int16)
for infile in infiles:
datafile = open(infile)
for line in datafile:
data[i] = line.split(" ")
i+=1
The culprit seemed to be numpy.loadtxt
, after removing it my script now doesn't need an extravagant amount of memory and even runs 2-3 times faster =)