1

In the past, I've tried to use scikit-learn's IncrementalPCA in order to reduce memory usage. I used this answer as a template for my code. But as @aarslan said in the comment section: "I've noticed that the explained variance seems to decrease at every iteration." I've always suspected the last for loop in the given answer. So, my question is: Do I need a for loop in order to keep a constant memory usage during partial_fit step or batch_size is alone enough? Below you can find the code:

import h5py
import numpy as np
from sklearn.decomposition import IncrementalPCA

h5 = h5py.File('rand-1Mx1K.h5')
data = h5['data'] # it's ok, the dataset is not fetched to memory yet

n = data.shape[0] # how many rows we have in the dataset
chunk_size = 1000 # how many rows we feed to IPCA at a time, the divisor of n
icpa = IncrementalPCA(n_components=10, batch_size=16)

for i in range(0, n//chunk_size):
    ipca.partial_fit(data[i*chunk_size : (i+1)*chunk_size])
MehmedB
  • 1,059
  • 1
  • 16
  • 42

1 Answers1

0

An old question, but yes, the for-loop is needed. The batch_size= parameter is only used with the .fit() method, not with .partial_fit().

Scikit-learn documentation:

batch_size : int, default=None

The number of samples to use for each batch. Only used when calling fit.

AlexK
  • 2,855
  • 9
  • 16
  • 27