Incremental PCA on big dataset, with large component demand

Question

I am trying to find the main 200 components of a datasets of 846 images (2048x2048x3 RGB) with sklearn.decomposition.IncrementalPCA.

Data are read by cv2 and reshaped into a 2d np array ([846,2048x2048x3] size, float16)

To ensure a smaller memory cost, I used partial_fit() and divide the original data into smaller chunks (batches) in both partial_fit() and transform() steps.

just like the way in this problem's solution: Python PCA on Matrix too large to fit into memory

Now my code works well for relative smaller size computations, like computing 20 components for 200 images in the datasets. It outputs right outcomes.

However, the tasks demands me to compute 200 components, which leads to the limit that my batch's size should be larger or at least equal to 200. (according to sklearn's document and the information in the terminal when running the code) https://scikit-learn.org/stable/modules/generated/sklearn.decomposition.IncrementalPCA.html

With such big chunk size,I can finish the IPCA model set, but always face MemoryError when doing partial_fit()

What's more, another problem is: I need to use inverse_transform later, I am not sure if I can use chunk-style compute in this step or not. (In the code below I did not use it.)

What can I do to avoid this MemoryError? Or should I replace IncrementalPCA with some other method instead ? (these alternatives should have some method like inverse_transform())

The all memory I can access to is 131661572 kB(~127GB)

My code:

from sklearn.decomposition import PCA, IncrementalPCA
import numpy as np
import cv2
import os

folder_path = "./output_img"
input=[]
for i in range(1, 847):
    if i%10 == 0: print("loading",i,"th image")
    # if i == 60: continue #special case, should be skipped
    image_path = folder_path+f"/{i}neutral.jpg"
    img = cv2.imread(image_path)
    input.append(img.reshape(-1))
print("Loaded all",i,"images")
# change into numpy matrix
all_image = np.stack(input,axis=0)
# trans to 0-1 format float64
all_image = (all_image.astype(np.float16))

### shape: #_of_imag x image_pixel_num (50331648 for img_normals case)
# print(all_image)
# print(all_image.shape)

# PCA, keeps 200 features
COM_NUM=200
pca=IncrementalPCA(n_components = COM_NUM)
print("finished IPCA model set")

saving_path = "./principle847"

element_num = all_image.shape[0] # how many elements(rows) we have in the dataset
chunk_size = 220 # how many elements we feed to IPCA at a time

for i in range(0, element_num//chunk_size):
    pca.partial_fit(all_image[i*chunk_size : (i+1)*chunk_size])
    print("finished PCA fit:",i*chunk_size,"to",(i+1)*chunk_size)
pca.partial_fit(all_image[(i+1)*chunk_size : element_num]) #tail
print("finished PCA fit:",(i+1)*chunk_size,"to",element_num)

for i in range(0, element_num//chunk_size):
    if i==0:
        result =  pca.transform(all_image[i*chunk_size : (i+1)*chunk_size])
    else:
        tmp = pca.transform(all_image[i*chunk_size : (i+1)*chunk_size])
        result = np.concatenate((result, tmp), axis=0)
    print("finished PCA transform:",i*chunk_size,"to",(i+1)*chunk_size)

tmp = pca.transform(all_image[(i+1)*chunk_size : element_num]) #tail
result = np.concatenate((result, tmp), axis=0)
print("finished PCA transform:",(i+1)*chunk_size,"to",element_num)

result = pca.inverse_transform(result)
print("PCA mean:",pca.mean_)

mean_img = pca.mean_
mean_img = mean_img.reshape(2048,2048,3)
mean_img = mean_img.astype(np.uint8)
cv2.imwrite(os.path.join(saving_path,("mean.png")),mean_img)

result=result.reshape(-1,2048,2048,3)
# result shape: #_of_componets * 2048 * 2048 * 3
dst = result
# dst=result/np.linalg.norm(result,axis=(3),keepdims=True)
for j in range(0,COM_NUM):
    reconImage = (dst)[j]
    # reconImage = reconImage.reshape(4096,4096,3)
    reconImage = np.clip(reconImage,0,255)
    reconImage = reconImage.astype(np.uint8)
    cv2.imwrite(os.path.join(saving_path,("p"+str(j)+".png")),reconImage)
    print("Saved",j+1,"principle imgs")

The error goes like:

 File "model_generate.py", line 36, in <module>
    pca.partial_fit(all_image[i*chunk_size : (i+1)*chunk_size])
  File "/root/anaconda3/envs/PCA/lib/python3.8/site-packages/sklearn/decomposition/_incremental_pca.py", line 299, in partial_fit
    U, V = svd_flip(U, V, u_based_decision=False)
  File "/root/anaconda3/envs/PCA/lib/python3.8/site-packages/sklearn/utils/extmath.py", line 538, in svd_flip
    max_abs_rows = np.argmax(np.abs(v), axis=1)
  File "/root/anaconda3/envs/PCA/lib/python3.8/site-packages/numpy/core/fromnumeric.py", line 1103, in argmax
    return _wrapfunc(a, 'argmax', axis=axis, out=out)
  File "/root/anaconda3/envs/PCA/lib/python3.8/site-packages/numpy/core/fromnumeric.py", line 56, in _wrapfunc
    return getattr(obj, method)(*args, **kwds)
MemoryError

To clarify: You wish to compute a PCA on the channel components of an image dataset (~12.6 mio features)? This would imply a covariance matrix with several billion entries which you approximate using 800-some samples. Would downsampling the images (and potentially converting to gray) be an option? — FirefoxMetzger, Apr 21 '22 at 06:37
Thanks for @FirefoxMetzger 's reply. However, downsampling these images or converting them to grey is not very acceptable for the further steps behind this procedure. — capdo, Apr 21 '22 at 07:04
Mathematically, your PCA cannot have more components than you have inputs, as this would be an under-determined system. You may be able to do a PCA for each RGB channel (reducing the data input by 1/3) and then doing another PCA of the stacked results of 3 PCAs, and then compose the kernels into a single PCA matrix. The implementation of that would be beyond the time I have to commit right now, but it should work mathematically. — Daniel F, Apr 21 '22 at 08:10
Thanks for @Dainiel F 's suggestion. Actually I have considered to separate 3 channels and do PCA, but I am wondering what should I do with these three PCA results. Obviously just do a superposition of 3 results is not the same with doing the original PCA. Can you make a clearer explanation about 'Doing another PCA of the stacked results of 3 PCAs,' ? What's its input and output ? — capdo, Apr 21 '22 at 08:59
@DanielF I don't think separating the channels is such a great idea, as the information is typically highly redundant (which is why using a grayscale instead of RGB is typically a good idea). — Paul Brodersen, Apr 21 '22 at 13:13
Is the memory error triggered on the first iteration? If not, try setting `copy=False` when initializing the `IncrementalPCA` object. — Paul Brodersen, Apr 21 '22 at 13:21

Incremental PCA on big dataset, with large component demand

0 Answers0