5

I have a huge dataset on which I wish to PCA. I am limited by RAM and computational efficency of PCA. Therefore, I shifted to using Iterative PCA.

Dataset Size-(140000,3504)

The documentation states that This algorithm has constant memory complexity, on the order of batch_size, enabling use of np.memmap files without loading the entire file into memory.

This is really good, but unsure on how take advantage of this.

I tried load one memmap hoping it would access it in chunks but my RAM blew. My code below ends up using a lot of RAM:

ut = np.memmap('my_array.mmap', dtype=np.float16, mode='w+', shape=(140000,3504))
clf=IncrementalPCA(copy=False)
X_train=clf.fit_transform(ut)

When I say "my RAM blew", the Traceback I see is:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "C:\Python27\lib\site-packages\sklearn\base.py", line 433, in fit_transfo
rm
    return self.fit(X, **fit_params).transform(X)
  File "C:\Python27\lib\site-packages\sklearn\decomposition\incremental_pca.py",
 line 171, in fit
    X = check_array(X, dtype=np.float)
  File "C:\Python27\lib\site-packages\sklearn\utils\validation.py", line 347, in
 check_array
    array = np.array(array, dtype=dtype, order=order, copy=copy)
MemoryError

How can I improve this without comprising on accuracy by reducing the batch-size?


My ideas to diagnose:

I looked at the sklearn source code and in the fit() function Source Code I can see the following. This makes sense to me, but I am still unsure about what is wrong in my case.

for batch in gen_batches(n_samples, self.batch_size_):
        self.partial_fit(X[batch])
return self

Edit: Worst case scenario I will have to write my own code for iterativePCA which batch processes by reading and closing .npy files. But that would defeat the purpose of taking advantage of already present hack.

Edit2: If somehow I could delete a batch of processed memmap file. It would make much sense.

Edit3: Ideally if IncrementalPCA.fit() is just using batches it should not crash my RAM. Posting the whole code, just to make sure I am not making a mistake in flushing the memmap completely to disk before.

temp_train_data=X_train[1000:]
temp_labels=y[1000:] 
out = np.empty((200001, 3504), np.int64)
for index,row in enumerate(temp_train_data):
    actual_index=index+1000
    data=X_train[actual_index-1000:actual_index+1].ravel()
    __,cd_i=pywt.dwt(data,'haar')
    out[index] = cd_i
out.flush()
pca_obj=IncrementalPCA()
clf = pca_obj.fit(out)

Surprisingly, I note out.flush doesn't free my memory. Is there a way to using del out to free my memory completely and then someone pass a pointer of the file to IncrementalPCA.fit().

J Richard Snape
  • 20,116
  • 5
  • 51
  • 79
Abhishek Bhatia
  • 9,404
  • 26
  • 87
  • 142
  • I believe [this answer](http://stackoverflow.com/a/16633274/832621) will give you some hints – Saullo G. P. Castro Aug 27 '15 at 09:44
  • 1
    1) Why don't you specified desired number of features after transformation? I mean after transformation you probably will get same dataset, but in RAM (because it was generated by transform), thus it consume big amount of RAM. Specify n_components parameter in constructor. 2) Even if you specified n_components (which must be less or equal than number of features in dataset), maybe it will not fit into memory, because you are trying to compute transformed dataset in one turn. Maybe you need to transform it by batches, calling transform method on every batch, and saving transformed data to HDD. – Ibraim Ganiev Aug 27 '15 at 12:12
  • @Olologin Thanks for the comment! I tried just `clf = IncrementalPCA().fit(X_train_mmap)` and it crashes my RAM. I am keen to save 98% variance. – Abhishek Bhatia Aug 27 '15 at 16:19
  • @AbhishekBhatia, try to use IncrementalPCA(n_components=1).fit(X_train_mmap). Is it completes successfully? – Ibraim Ganiev Aug 27 '15 at 17:19
  • @Olologin trying that. Can you please check edit 3 in the question. – Abhishek Bhatia Aug 27 '15 at 17:38
  • @Olologin Still crashes. – Abhishek Bhatia Aug 28 '15 at 06:28
  • @AbhishekBhatia, at which point inside partial_fit() it crashes? It crashes with memory exception? – Ibraim Ganiev Aug 28 '15 at 06:48
  • @Olologin It exceeds my computer's RAM capacity and freezes it. Please check this too. http://chat.stackoverflow.com/transcript/6?m=25348558#25348558 – Abhishek Bhatia Aug 28 '15 at 10:12
  • 1
    You are missing some really crucial information in the question - making it hard for @Olologin to help. Firstly - you say "blew my RAM". You should put the full traceback. It would also be helpful to say that the `MemoryError` occurs instantly on the call to `fit` - not after some heavy processing. I know you are trying to help by saving space and showing you have thought about where the problem might be, but always put the traceback. Can you check the Traceback I have edited into your question matches what you see? – J Richard Snape Aug 28 '15 at 11:23
  • Also - in the code you post in Edit 3 - you aren't using a `memmap` - `np.empty` is a normal array. Could you edit it to fit the rest of the question? – J Richard Snape Aug 28 '15 at 14:34

2 Answers2

7

You have hit a problem with sklearn in a 32 bit environment. I presume you are using np.float16 because you're in a 32 bit environment and you need that to allow you to create the memmap object without numpy thowing errors.

In a 64 bit environment (tested with Python3.3 64 bit on Windows), your code just works out of the box. So, if you have a 64 bit computer available - install python 64-bit and numpy, scipy, scikit-learn 64 bit and you are good to go.

Unfortunately, if you cannot do this, there is no easy fix. I have raised an issue on github here, but it is not easy to patch. The fundamental problem is that within the library, if your type is float16, a copy of the array to memory is triggered. The detail of this is below.

So, I hope you have access to a 64 bit environment with plenty of RAM. If not, you will have to split up your array yourself and batch process it, a rather larger task...

N.B It's really good to see you going to the source to diagnose your problem :) However, if you look at the line where the code fails (from the Traceback), you will see that the for batch in gen_batches code that you found is never reached.


Detailed diagnosis:

The actual error generated by OP code:

import numpy as np
from sklearn.decomposition import IncrementalPCA

ut = np.memmap('my_array.mmap', dtype=np.float16, mode='w+', shape=(140000,3504))
clf=IncrementalPCA(copy=False)
X_train=clf.fit_transform(ut)

is

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "C:\Python27\lib\site-packages\sklearn\base.py", line 433, in fit_transfo
rm
    return self.fit(X, **fit_params).transform(X)
  File "C:\Python27\lib\site-packages\sklearn\decomposition\incremental_pca.py",
 line 171, in fit
    X = check_array(X, dtype=np.float)
  File "C:\Python27\lib\site-packages\sklearn\utils\validation.py", line 347, in
 check_array
    array = np.array(array, dtype=dtype, order=order, copy=copy)
MemoryError

The call to check_array(code link) uses dtype=np.float, but the original array has dtype=np.float16. Even though the check_array() function defaults to copy=False and passes this to np.array(), this is ignored (as per the docs), to satisfy the requirement that the dtype is different; therefore a copy is made by np.array.

This could be solved in the IncrementalPCA code by ensuring that the dtype was preserved for arrays with dtype in (np.float16, np.float32, np.float64). However, when I tried that patch, it only pushed the MemoryError further along the chain of execution.

The same copying problem occurs when the code calls linalg.svd() from the main scipy code and this time the error occurs during a call to gesdd(), a wrapped native function from lapack. Thus, I do not think there is a way to patch this (at least not an easy way - it is at minimum alteration of code in core scipy).

J Richard Snape
  • 20,116
  • 5
  • 51
  • 79
  • This is a good remark. It's probably better to memory map data stored with `dtype=np.float64` to begin with in that case. Alternatively it's also possible to explicitly unroll the fitting my calling `partial_fit` manually with explicitly loaded data on the go and forget about memory mapping. – ogrisel Aug 28 '15 at 16:00
  • 1
    @ogrisel yes - you can start with `dtype=np.float64`, but then the OP (if in a 32 bit environment) can't create a `memmap` with that many members. At least I think that's the case, I'm struggling a little to understand the exact use case. I think you're right to suggest chunking up the data 'manually' and passing it to `partial_fit`. – J Richard Snape Aug 28 '15 at 16:07
  • @AbhishekBhatia Does this answer the question? If so, I'd appreciate an accept. I know it probably doesn't enable you to do what you want, but it does explain in detail why it fails. See the linked github issue for any progress. Note that the patches I suggest there will allow you to use `fit()`, but not `fit_transform()`. – J Richard Snape Sep 08 '15 at 10:55
1

Does the following alone trigger the crash?

X_train_mmap = np.memmap('my_array.mmap', dtype=np.float16,
                         mode='w+', shape=(n_samples, n_features))
clf = IncrementalPCA(n_components=50).fit(X_train_mmap)

If not then you can use that model to transform (project your data iteratively) to a smaller data using batches:

X_projected_mmap = np.memmap('my_result_array.mmap', dtype=np.float16,
                             mode='w+', shape=(n_samples, clf.n_components))
for batch in gen_batches(n_samples, self.batch_size_):
    X_batch_projected = clf.transform(X_train_mmap[batch])
    X_projected_mmap[batch] = X_batch_projected

I have not tested that code but I hope that you get the idea.

ogrisel
  • 39,309
  • 12
  • 116
  • 125
  • Nice Idea!! This tends to work better but why the specify the `number of components`. How does that effect the memory? – Abhishek Bhatia Aug 27 '15 at 14:47
  • Is there a way to completely flush memmap from memory in Python and somehow just store a pointer? I notice `memmap_object.flush()` and `del memmap_object` have different effects. I want to pass just pointer to IncrementalPCA if possible. – Abhishek Bhatia Aug 27 '15 at 17:21
  • Unfortunately, even with the smallest possible (probably ridiculous) `n_components=1`, the call to `fit` in `IncrementalPCA` will likely fail when the data array (even when using `memmap`) is this size. See [my answer below](http://stackoverflow.com/a/32269827/838992) for why (maybe OP can verify that setting `n_components=1` - it fails on my machine given the same setup). Hopefully the OP has access to a 64 bit computer with plenty of RAM. – J Richard Snape Aug 28 '15 at 11:34