I am trying to find the main 200 components of a datasets of 846 images (2048x2048x3 RGB) with sklearn.decomposition.IncrementalPCA
.
Data are read by cv2 and reshaped into a 2d np array ([846,2048x2048x3] size, float16)
To ensure a smaller memory cost, I used partial_fit()
and divide the original data into smaller chunks (batches) in both partial_fit()
and transform()
steps.
just like the way in this problem's solution: Python PCA on Matrix too large to fit into memory
Now my code works well for relative smaller size computations, like computing 20 components for 200 images in the datasets. It outputs right outcomes.
However, the tasks demands me to compute 200 components, which leads to the limit that my batch's size should be larger or at least equal to 200. (according to sklearn's document and the information in the terminal when running the code) https://scikit-learn.org/stable/modules/generated/sklearn.decomposition.IncrementalPCA.html
With such big chunk size,I can finish the IPCA model set, but always face MemoryError
when doing partial_fit()
What's more, another problem is:
I need to use inverse_transform
later, I am not sure if I can use chunk-style compute in this step or not. (In the code below I did not use it.)
What can I do to avoid this MemoryError
? Or should I replace IncrementalPCA
with some other method instead ? (these alternatives should have some method like inverse_transform()
)
The all memory I can access to is 131661572 kB
(~127GB)
My code:
from sklearn.decomposition import PCA, IncrementalPCA
import numpy as np
import cv2
import os
folder_path = "./output_img"
input=[]
for i in range(1, 847):
if i%10 == 0: print("loading",i,"th image")
# if i == 60: continue #special case, should be skipped
image_path = folder_path+f"/{i}neutral.jpg"
img = cv2.imread(image_path)
input.append(img.reshape(-1))
print("Loaded all",i,"images")
# change into numpy matrix
all_image = np.stack(input,axis=0)
# trans to 0-1 format float64
all_image = (all_image.astype(np.float16))
### shape: #_of_imag x image_pixel_num (50331648 for img_normals case)
# print(all_image)
# print(all_image.shape)
# PCA, keeps 200 features
COM_NUM=200
pca=IncrementalPCA(n_components = COM_NUM)
print("finished IPCA model set")
saving_path = "./principle847"
element_num = all_image.shape[0] # how many elements(rows) we have in the dataset
chunk_size = 220 # how many elements we feed to IPCA at a time
for i in range(0, element_num//chunk_size):
pca.partial_fit(all_image[i*chunk_size : (i+1)*chunk_size])
print("finished PCA fit:",i*chunk_size,"to",(i+1)*chunk_size)
pca.partial_fit(all_image[(i+1)*chunk_size : element_num]) #tail
print("finished PCA fit:",(i+1)*chunk_size,"to",element_num)
for i in range(0, element_num//chunk_size):
if i==0:
result = pca.transform(all_image[i*chunk_size : (i+1)*chunk_size])
else:
tmp = pca.transform(all_image[i*chunk_size : (i+1)*chunk_size])
result = np.concatenate((result, tmp), axis=0)
print("finished PCA transform:",i*chunk_size,"to",(i+1)*chunk_size)
tmp = pca.transform(all_image[(i+1)*chunk_size : element_num]) #tail
result = np.concatenate((result, tmp), axis=0)
print("finished PCA transform:",(i+1)*chunk_size,"to",element_num)
result = pca.inverse_transform(result)
print("PCA mean:",pca.mean_)
mean_img = pca.mean_
mean_img = mean_img.reshape(2048,2048,3)
mean_img = mean_img.astype(np.uint8)
cv2.imwrite(os.path.join(saving_path,("mean.png")),mean_img)
result=result.reshape(-1,2048,2048,3)
# result shape: #_of_componets * 2048 * 2048 * 3
dst = result
# dst=result/np.linalg.norm(result,axis=(3),keepdims=True)
for j in range(0,COM_NUM):
reconImage = (dst)[j]
# reconImage = reconImage.reshape(4096,4096,3)
reconImage = np.clip(reconImage,0,255)
reconImage = reconImage.astype(np.uint8)
cv2.imwrite(os.path.join(saving_path,("p"+str(j)+".png")),reconImage)
print("Saved",j+1,"principle imgs")
The error goes like:
File "model_generate.py", line 36, in <module>
pca.partial_fit(all_image[i*chunk_size : (i+1)*chunk_size])
File "/root/anaconda3/envs/PCA/lib/python3.8/site-packages/sklearn/decomposition/_incremental_pca.py", line 299, in partial_fit
U, V = svd_flip(U, V, u_based_decision=False)
File "/root/anaconda3/envs/PCA/lib/python3.8/site-packages/sklearn/utils/extmath.py", line 538, in svd_flip
max_abs_rows = np.argmax(np.abs(v), axis=1)
File "/root/anaconda3/envs/PCA/lib/python3.8/site-packages/numpy/core/fromnumeric.py", line 1103, in argmax
return _wrapfunc(a, 'argmax', axis=axis, out=out)
File "/root/anaconda3/envs/PCA/lib/python3.8/site-packages/numpy/core/fromnumeric.py", line 56, in _wrapfunc
return getattr(obj, method)(*args, **kwds)
MemoryError