This is a reproducible example:
set.seed(1)
want <- sample(50, 40)
Iris <- iris[c(51:100, 101:150), ] ## only keep versicolor and virginica
## take our training and test sets
train <- droplevels(Iris[c((1:50)[want], (51:100)[want]), , drop = FALSE])
test <- droplevels(Iris[c((1:50)[-want], (51:100)[-want]), , drop = FALSE])
## fit the PCA
pc <- prcomp(train[, 1:4])
Now note that pc$x
is the rotated data. You used X %*% pc$rotation
(where X
is the training data matrix) but didn't centre the data first, but they are equivalent. Centring predictors in the regression may be useful.
## create data frame for logistic regression
mydata <- data.frame(Species = train[, "Species"], pc$x)
## ...and fit the model
mod <- glm(Species ~ PC1, data = mydata, family = binomial)
Predict the scores on PC1 for the test set data; that is, rotate the test set using the same rotation used to form the PCs of the training data. For that we can use the predict()
method for class "prcomp"
test.p <- predict(pc, newdata = test[, 1:4])
Now use that to predict the class
pred <- predict(mod, newdata = data.frame(test.p), type = "response")
pred
> pred
56 66 67 71 72
0.080427399 0.393133104 0.092661480 0.395813527 0.048277608
74 76 82 87 95
0.226191156 0.333553423 0.003860679 0.617977807 0.029469167
106 116 117 121 122
0.999648054 0.922145431 0.924464339 0.989271655 0.318477762
124 126 132 137 145
0.581235903 0.995224501 0.999770995 0.964825109 0.988121496
> 1 - pred
56 66 67 71 72
0.9195726006 0.6068668957 0.9073385196 0.6041864731 0.9517223918
74 76 82 87 95
0.7738088439 0.6664465767 0.9961393215 0.3820221934 0.9705308332
106 116 117 121 122
0.0003519463 0.0778545688 0.0755356606 0.0107283449 0.6815222382
124 126 132 137 145
0.4187640970 0.0047754987 0.0002290047 0.0351748912 0.0118785036
pred
contains the probability that the test observation is Iris virginica. Note that in glm()
when the response is a factor (as in this example) then the first level of that factor (here versicolor
) is taken as failure or 0
and the second and subsequent levels indicator success or 1
. As in this example there are only two classes, the model is parameterised in terms of versicolor
; 1 - pred
will give the predicted probability of virginica
.
I don't follow the error computation you included in the question and therefore will leave that up to you to work out. A cross-classification table of the success of the model can be generated however via:
> predSpecies <- factor(ifelse(pred >= 0.5, "virginica", "versicolor"))
> table(test$Species, predSpecies)
predSpecies
versicolor virginica
versicolor 9 1
virginica 1 9
indicating our model got two test set observations wrong.