Sklearn PCA returning an array with only one value, when given an array of hundreds

Question

I wrote a program intended to classify an image by similarity:

for i in g:
    fulFi = i

    tiva = []
    tivb = []

    a = cv2.imread(i)
    b = cv2.resize(a, (500, 500))

    img2 = flatten_image(b)
    tivb.append(img2)
    cb = np.array(tivb)
    iab = trueArray(cb)

    print "Image:                      " + (str(i)).split("/")[-1]
    print "Image Size                  " + str(len(iab))
    print "Image Data:                 " + str(iab) + "\n"



pca = RandomizedPCA(n_components=2)
X = pca.fit_transform(iab)
Xy = pca.transform(X)

knn = KNeighborsClassifier(n_neighbors=1)
knn.fit(X, Xy.ravel())

def aip(img):
    a = cv2.imread(img)
    b = cv2.resize(a, (500, 500))

    tivb = []

    r = flatten_image(b)
    tivb.append(r)
    o = np.array(tivb)
    l = trueArray(o)

    print "Test Image:                 " + (str(img)).split("/")[-1]
    print "Test Image Size             " + str(len(l))
    print "Test Image Data:            " + str(l) + "\n"

    return l


testIm = aip(sys.argv[2])
b = pca.fit_transform(testIm)
print         "KNN Prediction:             " + str(knn.predict(b))

And while it functioned perfectly, it had an error: it gave me the exact same value regardless of the image used:

Image:                      150119131035-green-bay-seattle-nfl-1024x576.jpg
Image Size                  750000
Image Data:                 [255 242 242 ..., 148 204 191]

Test Image:                 agun.jpg
Test Image Size             750000
Test Image Data:            [216 255 253 ..., 205 225 242]

KNN Prediction:             [-255.]

and

Image:                      150119131035-green-bay-seattle-nfl-1024x576.jpg
Image Size                  750000
Image Data:                 [255 242 242 ..., 148 204 191]

Test Image:                 bliss.jpg
Test Image Size             750000
Test Image Data:            [243 240 232 ...,  13  69  48]

KNN Prediction:             [-255.]

The KNN prediction is always 255, no matter the image used. After investigation further, A found that the problem was my PCA: For some reason, it was taking an array with 750000 values and returning an array with only one:

pca = RandomizedPCA(n_components=2)
X = pca.fit_transform(iab)
Xy = pca.transform(X)

print "Iab:                        " + str(iab)
print "Iab Type:                   " + str(type(iab))
print "Iab length:                 " + str(len(iab))



print "X Type:                     " + str(type(X))
print "X length:                   " + str(len(X))
print "X:                          " + str(X)


print "Xy Type:                    " + str(type(Xy))
print "Xy Length:                  " + str(len(X))
print "Xy:                         " + str(Xy)

gives this:

Image:                      150119131035-green-bay-seattle-nfl-1024x576.jpg
Image Size                  750000
Image Data:                 [255 242 242 ..., 148 204 191]

Iab:                        [255 242 242 ..., 148 204 191]
Iab Type:                   <type 'numpy.ndarray'>
Iab length:                 750000
X Type:                     <type 'numpy.ndarray'>
X length:                   1
X:                          [[ 0.]]
Xy Type:                    <type 'numpy.ndarray'>
Xy Length:                  1
Xy:                         [[-255.]]

My question is why? X and Xy should both have hundreds of values, not just one. The tutorial I followed didn't have an explanation, and the documentation only says that there needs to be the same array format for both the transform and the fit_transform. How should I be approaching this?

score 2 · Answer 1 · answered Jul 28 '16 at 16:09

2

If n_components=2, RandomizedPCA will only keep a maximum of 2 components (see the documentation here). Try increasing this to allow more components to be selected; this should solve your issue.

answered Jul 28 '16 at 16:09

Aurora0001

13,139
5
50
53

I just changed it to 10, and then 12. It's still only 1 both times. – Rich Jul 28 '16 at 16:13
Try using a *significantly* bigger number - you have 75000 features, so 12 is virtually nothing. – Aurora0001 Jul 28 '16 at 16:14
I went up to 250 and and X/Xy were *still* at only one value. I went up to 500 and 1000, and I got a `MemoryError`. – Rich Jul 28 '16 at 16:22
Also, my computer nearly froze at 250, it wasn't responding. – Rich Jul 28 '16 at 16:24
Try getting the `shape` attribute of `X` (`print(X.shape)`). Sometimes numpy arrays don't act as you expect and display their length as something different. – Aurora0001 Jul 28 '16 at 16:37
It was exactly one, for all of them: `Iab length: 750000, Iab Shape: (750000,); X length: 1, X Shape: (1, 1), X: [[ 0.]]; Xy Length: 1, Xy Shape: (1, 1); Xy: [[-255.]]` – Rich Jul 28 '16 at 21:00
This answer is incorrect. The number of components represent something totally different. – Rahul Murmuria Jul 28 '16 at 22:17

score 1 · Accepted Answer · edited May 23 '17 at 12:22

1

What you are doing with X = pca.fit_transform(iab) and Xy = pca.transform(X) is wrong.

You are loosing the iab variable for the two images. You need the flattened array of both images, outside of your for loop. However, after your first iteration, your second iteration overwrites the iab array.
Even if you saved the two arrays separately, as say iab[0] and iab[1], you will need to perform PCA on both and use both images represented along the transformed axes. You need to decide what to use to learn the transformation though.

Here is sample code:

# First initialize the PCA with desired components 
pca = RandomizedPCA(n_components=2)

# Next you need to fit data to learn the transformations
pca.fit(np.vstack(iab[0].shape(1, len(iab[0]), iab[1].shape(1, len(iab[1])))

# Finally you apply this learned transformation on input data
X[0] = pca.transform(iab[0])
X[1] = pca.transform(iab[1])

You basically learn PCA on a matrix. The rows represent each image. What you want to be doing is trying to identify which pixels in the image best describe the image. For this you need to input many images, and find which pixels differentiate between them better than others. In your way of using the fit, you simply input 100s of values in a 1D list, which effectively means, you had one value representing each image, and you had 100s of images.

Also in your case, you combined fit() and transform(), which is a valid use case, if only you understand what it represents. You missed transformation of the second image, regardless.

If you want to know more about how PCA works you can read this answer.

Finally, you cannot learn a KNN classifier on 1 training sample and 1 testing sample! Learning algorithms are meant to learn from a population of input.

All you seem to need is basic distance between the two. You need to pick a distance metric. If you choose to use Euclidean distance (also called the L2 norm), then here is the code for it:

dist = numpy.linalg.norm(X[0]-X[1])

You can also do this instead:

from scipy.spatial import distance
dist = distance.euclidean(X[0], X[1])

In any case, there is no meaning in transforming the transformed data again, as you are doing with Xy = pca.transform(X). That doesn't give you a target.

You can only apply classification such as KNN when you have say, 100 images, where 50 show a "tree" and the remaining 50 show a "car". Once you train the model, you can predict if a new image is of a tree or a car.

edited May 23 '17 at 12:22

Community

1
1

answered Jul 28 '16 at 22:00

Rahul Murmuria

428
1
3
16

Don't worry, I have hundreds of images. I was just using those two as an example to see if the script worked. – Rich Jul 29 '16 at 00:04
Could you go into some more detail about how your PCA fit actually works? – Rich Jul 29 '16 at 00:07
So, what you're saying is: iab has to be a *huge* array with every image as a line? That seems like it would be way, way too big. Every image converts to 750,000 values. – Rich Jul 29 '16 at 00:16
@Rich, if you didn't store that way too big array, then how do you intend to learn patterns from all images? Your way, you read images and overwrite each read with next one. – Rahul Murmuria Jul 29 '16 at 11:34
I encourage you to read the link I shared about PCA and find other more detailed sources. In short, in your case each image has 750000 features (each being a pixel at a certain position). What you want to do is reduce that dimension to something manageable, like 10 features (you chose n_components=2). What PCA does is finds linear combinations of the 750000 features that differentiate between the photos the most. Suppose 100 pixels in the center are the most differentiating among all photos, it will pick the first component as pixel_#5k + pixel_#5k1 + pixel#5k2... + pixel#5k100. – Rahul Murmuria Jul 29 '16 at 11:44
As far as learning systems are concerned, to test it, you need to try a bigger set than just 2 images. The idea with PCA and KNN are to learn patterns from a lot of images. Perhaps you can try with 10 or so training images, and then try to predict new images one at a time. Further, KNN is meant to classify the imges into one of many groups (such as the tree and car example). If you simply want the distance between the images, then you need a plain distance function looped over all images. You may want to pick an outlier detection algorithm, if you have no groups or classes of images. – Rahul Murmuria Jul 29 '16 at 11:48
Do select my answer as the right answer to your question – Rahul Murmuria Jul 29 '16 at 11:54
I just did. Thanks for all your help. – Rich Jul 29 '16 at 14:42
One last question: what about @Aurora0001's answer? Should the number of components actually be 20 or more? – Rich Jul 29 '16 at 14:46
When reducing the components from 750000 to say 2, you are loosing some information. What PCA does is gives you the most descriptive component first. The more components you have, the more variance you will retain from the original data. After you do `pca.fit()`, you can try `print pca.explained_variance_ratio_` to see how much variance each component in the transformed axes covered. So, if you used n_components as very large, the `sum(pca.explained_variance_ratio_)` will be equal to 1. If you used `n_components =2`, the sum might be smaller. You choose how much variance you wish to retain. – Rahul Murmuria Jul 29 '16 at 17:12

Sklearn PCA returning an array with only one value, when given an array of hundreds

2 Answers2