3

I've been testing out how well PCA and LDA works for classifying 3 different types of image tags I want to automatically identify. In my code, X is my data matrix where each row are the pixels from an image and y is a 1D array stating the classification of each row.

import matplotlib.pyplot as plt
from sklearn import datasets
from sklearn.decomposition import PCA
from sklearn.lda import LDA

pca = PCA(n_components=2)
X_r = pca.fit(X).transform(X)

plt.figure(figsize = (35, 20))
plt.scatter(X_r[:, 0], X_r[:, 1], c=y, s=200)

lda = LDA(n_components=2)
X_lda = lda.fit(X, y).transform(X)
plt.figure(figsize = (35, 20))
plt.scatter(X_lda[:, 0], X_lda[:, 1], c=y, s=200)

With the LDA, I end up with 3 clearly distinguishable clusters with only slight overlap between them. Now if I have a new image I want to classify, once I turn it into a 1D array, how do I predict which cluster it should fall into and if it falls too far from the centre how can I say that the classification is "inconclusive"? I was also curious what the ".transform(X)" function did to my data once I had fit it.

marbel
  • 7,560
  • 6
  • 49
  • 68
Jack Simpson
  • 1,681
  • 3
  • 30
  • 54

1 Answers1

9

After you trained your LDA model with some data X, you may want to project some other data, Z. in this case what you should do is:

lda = LDA(n_components=2) #creating a LDA object
lda = lda.fit(X, y) #learning the projection matrix
X_lda = lda.transform(X) #using the model to project X 
# .... getting Z as test data....
Z = lda.transform(Z) #using the model to project Z
z_labels = lda.predict(Z) #gives you the predicted label for each sample
z_prob = lda.predict_proba(Z) #the probability of each sample to belong to each class

Note that 'fit' is used for fitting the model, not fitting the data.

So transform is used in order to build the representation (projection in this case), and predict is used for predicting the label of each sample. (this is used for ALL classes that inherits from BaseEstimator in sklearn.

You can read the documentation for farther options and properties.

Also, sklearn's API allow you to do pca.fit_transform(X) instead of pca.fit(X).transform(X). Use this version when you are not interested in model itself after this point in the code.

A few comments: Since PCA is an Unsupervised approach, LDA is a better approach for doing this "visual" classification you are currently doing.

Moreover, If you are interested in classification, you may consider using different type of classifiers, not necessarily LDA, although it is a great approach for visualization.

AvidLearner
  • 4,123
  • 5
  • 35
  • 48
  • 1
    Thank-you so much for the help! The probability prediction is fantastic. You're right too about PCA and LDA - when I plotted with both approaches the PCA didn't really cluster that well but the LDA method produced 3 beautiful clusters with barely any overlap :) – Jack Simpson Jun 29 '15 at 06:50
  • 1
    @JackSimpson No Prob :) I'm glad it helped you – AvidLearner Jun 29 '15 at 06:53
  • 1
    I ran the code and got the following error: "ValueError: X has 2 features per sample; expecting 256" (that's how many columns I have). So I removed the "Z = lda.transform(X)" and performed label and probability prediction on the raw data itself without transforming it and it seems to work. I got back a list of my predictions (1, 2 or 3) for each class and for the probabilities I got an array with 3 values for each row which must mean the probability of it belonging to any of the 3 tag types. It looked like this: [ 9.81963930e-01 1.80360699e-02 4.09434909e-14] – Jack Simpson Jun 30 '15 at 02:41
  • Do those probabilities look reasonable? They all look rather small, and I'm looking for the largest value right? – Jack Simpson Jun 30 '15 at 02:41
  • It seems to be getting a 93% correct prediction rate though :) – Jack Simpson Jun 30 '15 at 03:55
  • @JackSimpson these probabilities do make sense - type `9.81963930e-01=` in google and see what you get. Also, I had a typo, corrected it. – AvidLearner Jun 30 '15 at 06:28
  • 1
    Doh! Massive brain fart, arghhh I got so used to associating the - signs in small numbers I didn't check what came after. Thanks for pulling me up. By the way, I used your transform step with my test dataset and used an SVM and the accuracy went up to 94.5%! – Jack Simpson Jun 30 '15 at 06:30
  • @JackSimpson How much did you get before? (using SVM without LDA) – AvidLearner Jun 30 '15 at 10:50
  • 41%, which is only slightly better than chance! – Jack Simpson Jul 02 '15 at 03:20
  • I had the same issue as @JackSimpson . Any idea why that happens? – David Hagan Dec 08 '15 at 02:33
  • @omerbp do you know how predict_proba works ? what is spits is a ndarray but how do we know which class does the predicted probability belong to – sushmit Feb 10 '17 at 05:16
  • Please update the link for the documentation, it is no longer available. Cheers! – Catalina Chircu Sep 19 '19 at 11:36