New data points for existing PCA model

Question

I have followed this tutorial to create and visualise a PCA. The part Im particularly interested in is adding new data points to the existing model.

As the tutorial suggests, one would use predict (ir.pca, newdata=tail(log.ir, 2)) to predict new PCs. But how do I add these new observations to the existing plot ? It doesnt look like predict function returns the same object as the ir.pca used in ggplot function.

I have found similar questions here and here but these are calculating new PCA scores and adding them to the variance plot (if I understood it correctly).

Ultimately what Im after is to see whether new points fall in within the confidence ellipse defined/derived using the initial dataset.

The code I'm using from the tutorial:

 # log transform 
    log.ir <- log(iris[, 1:4])
    ir.species <- iris[, 5]

 
# apply PCA - scale. = TRUE is highly 
# advisable, but default is FALSE. 
ir.pca <- prcomp(log.ir,
                 center = TRUE,
                 scale. = TRUE) 

library(devtools)
install_github("ggbiplot", "vqv")
 
library(ggbiplot)
g <- ggbiplot(ir.pca, obs.scale = 1, var.scale = 1, 
              groups = ir.species, ellipse = TRUE, 
              circle = TRUE)
g <- g + scale_color_discrete(name = '')
g <- g + theme(legend.direction = 'horizontal', 
               legend.position = 'top')
print(g)

And as the tutorial suggests I'd like to add new data which came in to the existing plot visualised with ggplot

Thanks

It would be easier to help if you provided a [reproducible example](http://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example) in the question itself. — MrFlick, May 20 '17 at 18:51

score 1 · Accepted Answer · answered May 20 '17 at 22:05

When we inspect the ggplot object, we see it has an element named data:

str(g)
# List of 9
#  $ data       :'data.frame':  150 obs. of  3 variables:
#   ..$ xvar  : num [1:150] -2.41 -2.22 -2.58 -2.45 -2.54 ...
#   ..$ yvar  : num [1:150] -0.397 0.69 0.428 0.686 -0.508 ...
#   ..$ groups: Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ...
#  $ layers     :List of 5
#  <snip>

Hence we can just add the new data points to the data dataframe. Suppose these 10 observations from iris are our "new" observations, and we predict their PC values:

set.seed(123)
x <- sample(seq_len(nrow(iris)), 10)
predicted <- predict(ir.pca, newdata = log.ir[x, ])

We can add these predicted values to the data dataframe

g$data <- rbind(g$data, 
  data.frame(
    xvar = predicted[, "PC1"],
    yvar = predicted[, "PC2"],
    groups = "new"
  )
)

so that print(g) yields

thanks for the answer. Additional question just to clarify something for me: Is there a difference between using predict function to visualise new data points ( like you did above ) and just going through the whole PCA compilation process from the start (and then superimposing that on top of existing PCA ) ? — nlv, May 22 '17 at 08:58
I'm not an expert on PCA, so you might want to look elsewhere for a definitive answer, but my intuition is that yes there is a difference, in that your principal components will almost certainly be different if you run PCA on a new dataset (that combines "existing" and "new" observations). — Weihuang Wong, May 22 '17 at 14:54

New data points for existing PCA model

1 Answers1