Incremental PCA on big data

Question

I just tried using the IncrementalPCA from sklearn.decomposition, but it threw a MemoryError just like the PCA and RandomizedPCA before. My problem is, that the matrix I am trying to load is too big to fit into RAM. Right now it is stored in an hdf5 database as dataset of shape ~(1000000, 1000), so I have 1.000.000.000 float32 values. I thought IncrementalPCA loads the data in batches, but apparently it tries to load the entire dataset, which does not help. How is this library meant to be used? Is the hdf5 format the problem?

from sklearn.decomposition import IncrementalPCA
import h5py

db = h5py.File("db.h5","r")
data = db["data"]
IncrementalPCA(n_components=10, batch_size=1).fit(data)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/software/anaconda/2.3.0/lib/python2.7/site-packages/sklearn/decomposition/incremental_pca.py", line 165, in fit
    X = check_array(X, dtype=np.float)
  File "/software/anaconda/2.3.0/lib/python2.7/site-packages/sklearn/utils/validation.py", line 337, in check_array
    array = np.atleast_2d(array)
  File "/software/anaconda/2.3.0/lib/python2.7/site-packages/numpy/core/shape_base.py", line 99, in atleast_2d
    ary = asanyarray(ary)
  File "/software/anaconda/2.3.0/lib/python2.7/site-packages/numpy/core/numeric.py", line 514, in asanyarray
    return array(a, dtype, copy=False, order=order, subok=True)
  File "h5py/_objects.pyx", line 54, in h5py._objects.with_phil.wrapper (-------src-dir-------/h5py/_objects.c:2458)
  File "h5py/_objects.pyx", line 55, in h5py._objects.with_phil.wrapper (-------src-dir-------/h5py/_objects.c:2415)
  File "/software/anaconda/2.3.0/lib/python2.7/site-packages/h5py/_hl/dataset.py", line 640, in __array__
    arr = numpy.empty(self.shape, dtype=self.dtype if dtype is None else dtype)
MemoryError

Thanks for help

sastanin · Accepted Answer · 2020-01-09T09:58:40.163

25

You program is probably failing in trying to load the entire dataset into RAM. 32 bits per float32 × 1,000,000 × 1000 is 3.7 GiB. That can be a problem on machines with only 4 GiB RAM. To check that it's actually the problem, try creating an array of this size alone:

>>> import numpy as np
>>> np.zeros((1000000, 1000), dtype=np.float32)

If you see a MemoryError, you either need more RAM, or you need to process your dataset one chunk at a time.

With h5py datasets we just should avoid passing the entire dataset to our methods, and pass slices of the dataset instead. One at a time.

As I don't have your data, let me start from creating a random dataset of the same size:

import h5py
import numpy as np
h5 = h5py.File('rand-1Mx1K.h5', 'w')
h5.create_dataset('data', shape=(1000000,1000), dtype=np.float32)
for i in range(1000):
    h5['data'][i*1000:(i+1)*1000] = np.random.rand(1000, 1000)
h5.close()

It creates a nice 3.8 GiB file.

Now, if we are in Linux, we can limit how much memory is available to our program:

$ bash
$ ulimit -m $((1024*1024*2))
$ ulimit -m
2097152

Now if we try to run your code, we'll get the MemoryError. (press Ctrl-D to quit the new bash session and reset the limit later)

Let's try to solve the problem. We'll create an IncrementalPCA object, and will call its .partial_fit() method many times, providing a different slice of the dataset each time.

import h5py
import numpy as np
from sklearn.decomposition import IncrementalPCA

h5 = h5py.File('rand-1Mx1K.h5', 'r')
data = h5['data'] # it's ok, the dataset is not fetched to memory yet

n = data.shape[0] # how many rows we have in the dataset
chunk_size = 1000 # how many rows we feed to IPCA at a time, the divisor of n
ipca = IncrementalPCA(n_components=10, batch_size=16)

for i in range(0, n//chunk_size):
    ipca.partial_fit(data[i*chunk_size : (i+1)*chunk_size])

It seems to be working for me, and if I look at what top reports, the memory allocation stays below 200M.

edited Jan 09 '20 at 09:58

answered Jul 15 '15 at 13:41

sastanin

40,473
13
103
130

Okay, so basically I should not call fit but partial_fit several times. I didn't see that method, because the tutorial uses fit. Do you know why fit has the parameter batch_size for, if it loads the entire dataset at once? – KrawallKurt Jul 15 '15 at 17:21
The trick of not loading all data in memory is probably taken care by h5py library. Its dataset object (`h5['data']`) appears to behave like a regular numpy array, but it is not. `IncrementalPCA` doesn't know that it's an on-disk data structure, and at some point reads all rows (`MemoryError`!). The calculation is still executated in `batch_size` batches. – sastanin Jul 17 '15 at 13:24
1

This happens in `fit()` which [calls check_array()](https://github.com/scikit-learn/scikit-learn/blob/0.16.1/sklearn/decomposition/incremental_pca.py#L165) which is supposed to convert data to a regular numpy array (https://github.com/scikit-learn/scikit-learn/blob/0.16.1/sklearn/utils/validation.py#L268) Calling `partial_fit()` bypasses this conversion. – sastanin Jul 17 '15 at 13:27
3

@sastanin I've noticed that the explained variance seems to decrease at every iteration. Is that normal? I'd expect it to follow a convex curve and approach 100% at one point. But I'm also not sure if partial fit requires a certain relationship between the batch size and the number of features. – aarslan Dec 02 '15 at 16:17
I am also skeptical about the last for loop. I am not sure but the reason for constant memory allocation is not because of the for loop but because of the batch_size. – MehmedB Jan 07 '20 at 17:08
1

@MehmedB The point of the example is not to find a solution, but to show how to process smaller chunks of data. Depending on your data set you may have to do random samples or do more iterations. In this case we're doing PCA on a white noise data. On a big enough data set all components should be equal. This IPCA is not supposed to converge (hence, explained variance decreases). – sastanin Jan 09 '20 at 10:14
@sastanin I was a bit confused. In fact, I opened a question about it here: https://stackoverflow.com/questions/59633104/do-you-need-a-for-loop-for-incrementalpca-in-order-to-keep-constant-memory-usage Thank you for the clarification. – MehmedB Jan 09 '20 at 17:32

score 1 · Answer 2 · answered Oct 20 '21 at 21:20

One can uses NumPy’s memmap class, which allows to manipulate a large array stored in a binary file on disk as if it were entirely in memory; the class loads only the data it needs in memory, when it needs it. Since incrementalPCA uses batches at any given time the memory usage remains under control. here is a sample code

from sklearn.decomposition import IncrementalPCA
import numpy as np

X_mm = np.memmap(filename, dtype="float32", mode="readonly", shape=(m, n))
batch_size = m // n_batches
inc_pca = IncrementalPCA(n_components=10, batch_size=batch_size)
inc_pca.fit(X_mm)

Incremental PCA on big data

2 Answers2

Linked