In the past, I've tried to use scikit-learn's IncrementalPCA in order to reduce memory usage. I used this answer as a template for my code. But as @aarslan said in the comment section: "I've noticed that the explained variance seems to decrease at every iteration." I've always suspected the last for loop
in the given answer. So, my question is: Do I need a for loop in order to keep a constant memory usage during partial_fit
step or batch_size
is alone enough? Below you can find the code:
import h5py
import numpy as np
from sklearn.decomposition import IncrementalPCA
h5 = h5py.File('rand-1Mx1K.h5')
data = h5['data'] # it's ok, the dataset is not fetched to memory yet
n = data.shape[0] # how many rows we have in the dataset
chunk_size = 1000 # how many rows we feed to IPCA at a time, the divisor of n
icpa = IncrementalPCA(n_components=10, batch_size=16)
for i in range(0, n//chunk_size):
ipca.partial_fit(data[i*chunk_size : (i+1)*chunk_size])