IncrementalPCA & partial_fit - number of components

Question

I work with python and about 4000 images of watches (examples: watch_1, watch_2). The images are rgb and their resolution is 450x450. My aim is to find the most similar watches among them. For this reason I am using IncrementalPCA and partial_fit of scikit_learn to handle these big data with my 26GB RAM (see also: SO_Link_1, SO_Link_2). My source code is the following:

import cv2
import numpy as np
import os
from glob import glob
from sklearn.decomposition import IncrementalPCA
from sklearn import neighbors
from sklearn import preprocessing


data = []

# Read images from file #
for filename in glob('Watches/*.jpg'):

    img = cv2.imread(filename)
    height, width = img.shape[:2]
    img = np.array(img)

    # Check that all my images are of the same resolution
    if height == 450 and width == 450:

        # Reshape each image so that it is stored in one line
        img = np.concatenate(img, axis=0)
        img = np.concatenate(img, axis=0)
        data.append(img)

# Normalise data #
data = np.array(data)
Norm = preprocessing.Normalizer()
Norm.fit(data)
data = Norm.transform(data)

# IncrementalPCA model #
ipca = IncrementalPCA(n_components=6)

length = len(data)
chunk_size = 4
pca_data = np.zeros(shape=(length, ipca.n_components))

for i in range(0, length // chunk_size):
    ipca.partial_fit(data[i*chunk_size : (i+1)*chunk_size])
    pca_data[i * chunk_size: (i + 1) * chunk_size] = ipca.transform(data[i*chunk_size : (i+1)*chunk_size])

# K-Nearest neighbours #
knn = neighbors.NearestNeighbors(n_neighbors=4, algorithm='ball_tree', metric='minkowski').fit(data)
distances, indices = knn.kneighbors(data)
print(indices)

However when I run this program for start with 40 images of watches I get the following error when i = 1:

ValueError: Number of input features has changed from 4 to 6 between calls to partial_fit! Try setting n_components to a fixed value.

However, it is obvious that I set n_components to 6 when coding ipca = IncrementalPCA(n_components=6) but for some reason ipca considers chunk_size = 4 as the number of components when i = 0 and then when i = 1 changes to 6.

Why is this happening?

How can I fix it?

sascha · Accepted Answer · 2018-02-22T15:22:08.393

2

This seems to follow the math behind PCA as it will be ill-conditioned for n_components > n_samples.

You might be interested in reading this (introduction of error-message) and some discussion behind it.

Try to increase the batch-size / chunk-size (or lowering n_components).

(In general i'm also somewhat sceptic about this approach. I hope you tested it on some small example-dataset using batch-PCA. It does not seem your watches are preprocessed in regards to geometry: cropping; maybe hist-/color-normalization.)

edited Feb 22 '18 at 15:22

answered Feb 22 '18 at 14:39

sascha

32,238
6
68
110

Thank you for your response. Yes, when processing more images of watches I can have `n_components < n_samples` but with 40 images it was so not so reasonable. I maybe sceptic too about this approach but since I do not have a GPU then I cannot figure out it how to do this differently. Hmm, I do not get exactly what you mean by batch-PCA but if you mean applying PCA in batches by myself then I have not done it yet because I thought that `IncrementalPCA` is virtually the same to it. – Outcast Feb 22 '18 at 15:31
Concerning preprocessing, cropping is a good idea but there are also larger watches so I won't crop so much the images. Also I am normalising the data but probably you mean something different by hist-/color-normalisation... – Outcast Feb 22 '18 at 15:34
I would just be scared about PCAs low model-complexity in regards to different kind of geometries of watches. (Hist-eq would be [that](https://en.wikipedia.org/wiki/Histogram_equalization), common in preprocessing; but it seems you are using colors). It surely is not an easy task, especially without a supervised-like training-dataset. But that surely is not my area of expertise. Maybe you want some feature-extractors as HOG and co. in some step of your pipeline. Hard to say. – sascha Feb 22 '18 at 15:38
Yes, it is quite difficult as a task without a GPU. My next move was to try some feature extractors as you say. In any case, thanks for now! – Outcast Feb 22 '18 at 15:44

IncrementalPCA & partial_fit - number of components

1 Answers1