4

I am loading a csv file via numpy.loadtxt into a numpy array. My data has about 1 million records and 87 columns. While the object.nbytes is only 177159666 bytes, it actually takes much more meomory because I get 'MemoryError' while training a Decision Tree using scikit-learn. Also, after reading the data, the available memory in my system reduced by 1.8 gigs. I am working on linux machine with 3 gigs of memory. So does object.nbytes returns the real memory usage of an numpy array?

train = np.loadtxt('~/Py_train.csv', delimiter=',', skiprows=1, dtype='float16')
ibictts
  • 43
  • 1
  • 4
  • 1
    So, is there a question that you have? – Marcin Aug 02 '12 at 15:01
  • 2
    Here's a related question: http://stackoverflow.com/questions/11527964/convert-a-string-list-to-float32-efficiently . Basically, np.loadtxt takes up LOTS of memory because it first stores the data in lists and then converts those to an ndarray. (increasing memory usage by a factor of 3 or 4 at least). If you know the size, you might want to consider pre-allocating the array and parsing it yourself. Also, don't be afraid to look at the source for np.loadtxt. It's reasonably comprehendable. – mgilson Aug 02 '12 at 15:03
  • @Marcin, just updated my question. – ibictts Aug 02 '12 at 15:05
  • Thanks, @mgilson. Now I can understand the large peak memory usage. Do you find the nbytes attribute for ndarray accurate for estimating its memory usage? – ibictts Aug 02 '12 at 15:20

3 Answers3

6

I had a similar problem when trying to create a large 400,000 x 100,000 matrix. Fitting all of that data into an ndarray is impossible.

However, the big insight I came up with was that most of the values in the matrix are empty, and thus this can be represented as a sparse matrix. Sparse matrices are useful because it is able to represent the data using less memory. I used Scipy.sparse's sparse matrix implementation, and I'm able to fit this large matrix in-memory.

Here is my implementation:

https://github.com/paolodm/Kaggle/blob/master/mdschallenge/buildmatrix.py

Paolo del Mundo
  • 2,121
  • 13
  • 18
3

Probably, better performance is by using numpy.fromiter:

In [30]: numpy.fromiter((tuple(row) for row in csv.reader(open('/tmp/data.csv'))), dtype='i4,i4,i4')
Out[30]: 
array([(1, 2, 3), (4, 5, 6)], dtype=[('f0', '<i4'), ('f1', '<i4'), ('f2', '<i4')])

where

$ cat /tmp/data.csv 
1,2,3
4,5,6

Alternatively, I strongly suggest you to use pandas: it's based on numpy and has many utility functions to do statistical analysis.

lbolla
  • 5,387
  • 1
  • 22
  • 35
0

I just had the same problem:

My saved .npy file is 752M (on disk), and arr.nbytes = 701289568 (~669M); but np.load take 2.7g memory, i.e. 4x time the actual memory needed

https://github.com/numpy/numpy/issues/17461

and it turns out:

the data array contains mixed (small amount of) strings and (large amount of) numbers.

But each of those 8-byte locations points to a python object, and that object takes at least 24 bytes plus either space for the number or the string.

so, in memory (8-byte pointer + 24-bytes) ~= 4x times of mostly 8-byte (double number) in the file.

NOTE: np.save() and np.load() is not symmetric:

-- np.save() save the numeric type as scalar data, hence the disk file size is consistent with data size user have in mind, and is small

-- np.load() load the numeric type as PyObject, and inflate the memory usage 4x than the user expected.

This is the same for other file format, e.g csv files.

Conclusion: do not use mixed types (string as np.object, and np.numbers) in a np array. Use homogenous numeric type, e.g. np.double. Then memory will take about the same space as the dump disk file.

user873275
  • 136
  • 3
  • 8