1

I have followed this tutorial to create and visualise a PCA. The part Im particularly interested in is adding new data points to the existing model.

As the tutorial suggests, one would use predict (ir.pca, newdata=tail(log.ir, 2)) to predict new PCs. But how do I add these new observations to the existing plot ? It doesnt look like predict function returns the same object as the ir.pca used in ggplot function.

I have found similar questions here and here but these are calculating new PCA scores and adding them to the variance plot (if I understood it correctly).

Ultimately what Im after is to see whether new points fall in within the confidence ellipse defined/derived using the initial dataset.

The code I'm using from the tutorial:

 # log transform 
    log.ir <- log(iris[, 1:4])
    ir.species <- iris[, 5]

 
# apply PCA - scale. = TRUE is highly 
# advisable, but default is FALSE. 
ir.pca <- prcomp(log.ir,
                 center = TRUE,
                 scale. = TRUE) 

library(devtools)
install_github("ggbiplot", "vqv")
 
library(ggbiplot)
g <- ggbiplot(ir.pca, obs.scale = 1, var.scale = 1, 
              groups = ir.species, ellipse = TRUE, 
              circle = TRUE)
g <- g + scale_color_discrete(name = '')
g <- g + theme(legend.direction = 'horizontal', 
               legend.position = 'top')
print(g)

And as the tutorial suggests I'd like to add new data which came in to the existing plot visualised with ggplot

Thanks

nlv
  • 791
  • 7
  • 28
  • It would be easier to help if you provided a [reproducible example](http://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example) in the question itself. – MrFlick May 20 '17 at 18:51
  • @MrFlick updated – nlv May 20 '17 at 19:19

1 Answers1

1

When we inspect the ggplot object, we see it has an element named data:

str(g)
# List of 9
#  $ data       :'data.frame':  150 obs. of  3 variables:
#   ..$ xvar  : num [1:150] -2.41 -2.22 -2.58 -2.45 -2.54 ...
#   ..$ yvar  : num [1:150] -0.397 0.69 0.428 0.686 -0.508 ...
#   ..$ groups: Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ...
#  $ layers     :List of 5
#  <snip>

Hence we can just add the new data points to the data dataframe. Suppose these 10 observations from iris are our "new" observations, and we predict their PC values:

set.seed(123)
x <- sample(seq_len(nrow(iris)), 10)
predicted <- predict(ir.pca, newdata = log.ir[x, ])

We can add these predicted values to the data dataframe

g$data <- rbind(g$data, 
  data.frame(
    xvar = predicted[, "PC1"],
    yvar = predicted[, "PC2"],
    groups = "new"
  )
)

so that print(g) yields enter image description here

Weihuang Wong
  • 12,868
  • 2
  • 27
  • 48
  • thanks for the answer. Additional question just to clarify something for me: Is there a difference between using predict function to visualise new data points ( like you did above ) and just going through the whole PCA compilation process from the start (and then superimposing that on top of existing PCA ) ? – nlv May 22 '17 at 08:58
  • I'm not an expert on PCA, so you might want to look elsewhere for a definitive answer, but my intuition is that yes there is a difference, in that your principal components will almost certainly be different if you run PCA on a new dataset (that combines "existing" and "new" observations). – Weihuang Wong May 22 '17 at 14:54