Python and memory efficient way of importing 2d data

Question

I'm trying to run a few scripts analyzing data with Python and I've been quickly surprised by how much RAM space it takes:

My script reads two columns of integers from a file. It imports it in the following way:

import numpy as N
from sys import argv
infile = argv[1]
data = N.loadtxt(infile,dtype=N.int32)  //infile is the input file

For a file with almost 8 million lines, it takes around 1.5 Gb in ram (at this stages all it does is importing the data).

I tried running a memory profiler on it, giving me:

Line # Mem usage Increment Line Contents

 5   17.664 MiB    0.000 MiB   @profile
 6                             def func():
 7   17.668 MiB    0.004 MiB    infile = argv[1]
 8  258.980 MiB  241.312 MiB    data = N.loadtxt(infile,dtype=N.int32)

so 250Mb for the data, far from the 1.5Gb in memory (what is occupying so much space?)

and when I tried dividing it by 2 by using int16 instead of int32:

Line # Mem usage Increment Line Contents

 5   17.664 MiB    0.000 MiB   @profile
 6                             def func():
 7   17.668 MiB    0.004 MiB    infile = argv[1]
 8  229.387 MiB  211.719 MiB    data = N.loadtxt(infile,dtype=N.int16)

But I'm only saving a tenth, how come?

I don't know much of memory occupation, but is this normal?

Also, I coded the same thing in C++ storing the data in vector<int> objects and it only takes 120Mb in RAM.

To me, Python seems to sweep a lot under the rug when it comes to handling the memory, what is it doing that inflates the weight of the data? Is it more related to Numpy?

Inspired by the answer below, I'm now importing my data the following way:

infile = argv[1]
output = commands.getoutput("wc -l " + infile) #I'm using the wc linux command to read the number of lines in my file and so how much memory allocation do I need
n_lines = int(output.split(" ")[0]) #the first int is the number of lines
data = N.empty((n_lines,2),dtype=N.int16) #allocating
datafile = open(infile)
for count,line in enumerate(datafile): #reading line by line
    data[count] = line.split(" ") #filling the array

It also works very similarly with multiple files:

infiles = argv[1:]
n_lines = sum(int(commands.getoutput("wc -l " + infile).split(" ")[0]) for infile in infiles)
i = 0
data = N.empty((n_lines,2),dtype=N.int16)
for infile in infiles:
    datafile = open(infile)
    for line in datafile:
        data[i] = line.split(" ")
        i+=1

The culprit seemed to be numpy.loadtxt, after removing it my script now doesn't need an extravagant amount of memory and even runs 2-3 times faster =)

score 1 · Accepted Answer · edited May 23 '17 at 12:05

1

The loadtxt() method is not memory efficient because it uses a Python list to temporary store file contents. Here is a short explanation of why Python list take so much space.

One solution is to create your own implementation for reading text files, as below:

buffsize = 10000  # Increase this for large files
data = N.empty((buffsize, ncols))  # Init array with buffsize
dataFile = open(infile)

for count, line in enumerate(dataFile):
   if count >= len(data):
       data.resize((count + buffsize, ncols), recheck=False)
   line_values = ... <convert line into values> ...
   data[count] = line_values

# Fix array size
data.resize((count+1, ncols), recheck=False)
dataFile.close()

As sometimes we couldn't get the line count in advance, I defined a kind of buffering to avoid resizing the array all the time.

Note: at first, I came up with a solution using numpy.append. But as pointed out in the comments, append is also inefficient since it make a copy of the array contents.

edited May 23 '17 at 12:05

Community

1
1

answered Jul 11 '14 at 16:00

igortg

150
9

If this is indeed the problem, one solution would be to read and parse the file in chunks, and then concatenate the numpy arrays. But if there is a more memory efficient implementation available, that would be preferable; an I would not be surprised. OTOH, perhaps you need to take a step back when looking for efficiency in text files in the first place... – Eelco Hoogendoorn Jul 11 '14 at 16:20
2

I don't know much about numpy, but doesn't `append` always create a copy with the argument added? If so, this code is probably of little use: It takes O(n²) time instead of O(n) time (a serious problem for n = 8e6), and probably still doubles memory consumption from all the temporaries. – Jul 11 '14 at 16:21
You are right @delnan. I use the `resize` function in my internal implementation. But thought that using np.append would be a simpler solution. – igortg Jul 11 '14 at 16:48
@itghisi: Thank you very much indeed `numpy.loadtxt` is the culprit, inspired by what you proposed I coded an alternative, I'll share above (shortly I preferred counting the lines than using a buffer and resizing, and leaving empty lines in the end). – Learning is a mess Jul 11 '14 at 17:04
Nice @Learningisamess. I think there isn't a platform agnostic soluction to count lines, so I'll leave the buffering/resizing answer for now. Glad that I could help. – igortg Jul 11 '14 at 17:39

Python and memory efficient way of importing 2d data

Line # Mem usage Increment Line Contents

Line # Mem usage Increment Line Contents

1 Answers1

Linked