3

Using the prcomp function, how can I use unsupervised principal components derived from a dataset on the same dataset split into test and train?

train <- sample(1:nrow(auto), 60000)
x <- as.matrix(auto[,-1])  ##Covariates
y <- auto[,1]                   ##Response
pc <- prcomp(x)             ##Find Principal Components

data <- data.frame(y=y, (x %*% pc$rotation[,1:9]))
fit <- glm(y ~ ., data=data[train,], family="binomial")   ##Train It

prediction <- predict(fit, newdata=data) > 0  ##Prediction on Entire Data Set

error <- mean(y[-train]] != prediction[-train])  ##Mean out of Sample error
jon
  • 11,186
  • 19
  • 80
  • 132
John Smith
  • 57
  • 1
  • 2
  • 5
  • 4
    Please read the [faq](http://stackoverflow.com/faq) and this question on [reproducible examples](http://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example). I have no idea what you're going at, you need to give a more to the point description of what you want. And I can't even run your code to check what exactly you're doing... – Joris Meys Jun 04 '12 at 07:32

2 Answers2

7

This is a reproducible example:

set.seed(1)
want <- sample(50, 40)
Iris <- iris[c(51:100, 101:150), ] ## only keep versicolor and virginica
## take our training and test sets
train <- droplevels(Iris[c((1:50)[want], (51:100)[want]), , drop = FALSE])
test <- droplevels(Iris[c((1:50)[-want], (51:100)[-want]), , drop = FALSE])

## fit the PCA
pc <- prcomp(train[, 1:4])

Now note that pc$x is the rotated data. You used X %*% pc$rotation (where X is the training data matrix) but didn't centre the data first, but they are equivalent. Centring predictors in the regression may be useful.

## create data frame for logistic regression
mydata <- data.frame(Species = train[, "Species"], pc$x)
## ...and fit the model
mod <- glm(Species ~ PC1, data = mydata, family = binomial)

Predict the scores on PC1 for the test set data; that is, rotate the test set using the same rotation used to form the PCs of the training data. For that we can use the predict() method for class "prcomp"

test.p <- predict(pc, newdata = test[, 1:4])

Now use that to predict the class

pred <- predict(mod, newdata = data.frame(test.p), type = "response")
pred

> pred
         56          66          67          71          72 
0.080427399 0.393133104 0.092661480 0.395813527 0.048277608 
         74          76          82          87          95 
0.226191156 0.333553423 0.003860679 0.617977807 0.029469167 
        106         116         117         121         122 
0.999648054 0.922145431 0.924464339 0.989271655 0.318477762 
        124         126         132         137         145 
0.581235903 0.995224501 0.999770995 0.964825109 0.988121496 
> 1 - pred
          56           66           67           71           72 
0.9195726006 0.6068668957 0.9073385196 0.6041864731 0.9517223918 
          74           76           82           87           95 
0.7738088439 0.6664465767 0.9961393215 0.3820221934 0.9705308332 
         106          116          117          121          122 
0.0003519463 0.0778545688 0.0755356606 0.0107283449 0.6815222382 
         124          126          132          137          145 
0.4187640970 0.0047754987 0.0002290047 0.0351748912 0.0118785036

pred contains the probability that the test observation is Iris virginica. Note that in glm() when the response is a factor (as in this example) then the first level of that factor (here versicolor) is taken as failure or 0 and the second and subsequent levels indicator success or 1. As in this example there are only two classes, the model is parameterised in terms of versicolor; 1 - pred will give the predicted probability of virginica.

I don't follow the error computation you included in the question and therefore will leave that up to you to work out. A cross-classification table of the success of the model can be generated however via:

> predSpecies <- factor(ifelse(pred >= 0.5, "virginica", "versicolor"))
> table(test$Species, predSpecies)
            predSpecies
             versicolor virginica
  versicolor          9         1
  virginica           1         9

indicating our model got two test set observations wrong.

Gavin Simpson
  • 170,508
  • 25
  • 396
  • 453
2

You need to split your data into train and test as the very first step: otherwise the PC scores are far from being independent.

I.e. the PCA rotation is calculated from x [train,] only!

The same rotation is then applied to x [test,]

For everything else, as @Joran says, reproducible code is needed.

cbeleites unhappy with SX
  • 13,717
  • 5
  • 45
  • 57
  • 1
    You shouldn't, in theory, need to split them when it's unsupervised. – John Smith Jun 04 '12 at 12:43
  • 1
    @JohnSmith How would you interpret the out-of-sample error if you used information from the test set to form the PCs in the first place? It wouldn't be a real reflection of the true out-of-sample performance. – Gavin Simpson Jun 04 '12 at 12:48
  • @JohnSmith: Only calculations that are solely on a per sample basis mean that you don't need to split. Calculations that use more cases need the splitting, which is here the centering, possibly scaling as well as the rotation. – cbeleites unhappy with SX Jun 04 '12 at 13:19