How to conduct PCA on a table with abundance data BUT overlay it with vectors of environmental parameters of the sites in R?

Question

[Fig 4 below is what I would need as outcome, the other 2 figures show what I get from my data: PCA on environmental data or on the abundance data No duplication ofR - how to make PCA biplot more readable or Plotting pca biplot with ggplot2

It is conducting abundance and environmental data in PCA simultaneously, working with two different data frames: I was told it shall work in the way, that from PCA1 you get the coordinates for the species and than with a second command you overlay over the species coordinates vectors not from PCA 1 but from another dataset that has the same sites as the first one not abundance but environmental data.

My prof did a PCA on abundance and environmental data 10 years ago. He overlaid the results of the principal components analysis (PCA) of abundance data of species with the correlations between the PCA scores and the environmental factors potentially influencing the ciliate distributions. How can I do that in R, when I have one data frame where the abundance of each species at the 33 sites are listed and one data frame where 12 different environmental parameters of the 33 sites are listed? So e.g. with the following data

#Create random dataframe of abundance data, I am sure this can be done simpler and more elegant than this ;)
species<-c("spec1", "spec2", "spec3", "spec 4", "spec 5", "spec 6", "spec7")
site1<-c(2,4,19,34,3,6,9)
site2<-c(5,8,9,12,0,1,1)
site3<-c(23,56,7,1,1,1,2)
site4<-c(4,6,2,8,5,1,7)
abundance<-data.frame(species,site1,site2,site3,site4)
rownames(abundance)<-abundance$species
abundance<-abundance[,-1]
#Create random dataframe of abundance data
#environmental parameters of the sites
X<-c("site1","site2","site3","site4")
Temp<-c(24,24.5,23.5,25)
Chla<-c(2.2,1.5,2.0,3.4)
Salinity<-c(24,25,26,23)
Depth<-c(200,400,600,200)
environment<-data.frame(X,Temp,Chla,Salinity,Depth)
rownames(environment)<-environment$X
environment<-environment[,-1]
###PCA on abundance data
#hellinger pre-transformation of abundance data
??decostand
library(vegan)
abu.h<-decostand(abundance,"hellinger")
abu.h.pca<-prcomp(abu.h)
biplot(abu.h.pca)
##and now I would need to discard the sites vectors and overlay it with 
#the environmental sites factors, due to my prof?
?prcomp
envir.PCA<-prcomp(environment,scale = TRUE)
biplot(envir.PCA)
?biplot

Thank you for using a reproducible example in your first question! Good work! One tiny suggestion - in the future it may help to hyperlink or explain words that are specific to your field of study, like ciliate or (this context of) abundance. Finally, this doesn't sound like it is an IDE related problem, so the `rstudio` tag is probably inappropriate. — Hack-R, Jul 23 '18 at 14:48
Also, is this different from the general case of PCA bi-plots and overlaying data? If not, there are several existing answers already on StackOverflow — Hack-R, Jul 23 '18 at 14:54
Possible duplicate of [R - how to make PCA biplot more readable](https://stackoverflow.com/questions/17055291/r-how-to-make-pca-biplot-more-readable) — Hack-R, Jul 23 '18 at 14:56
Possible duplicate of [Plotting pca biplot with ggplot2](https://stackoverflow.com/questions/6578355/plotting-pca-biplot-with-ggplot2) — AkselA, Jul 23 '18 at 15:03
Thanks a lot for your help and the editing! It is a pity, but its no duplicate of the themes mentioned above. They are all working with one dataset! I have two different datasets. — Ta Ani, Jul 23 '18 at 19:15
@TaAni I'm trying to understand what you want. Do you mean the following, "extract the principal component variables for the `abundance` dataset and then map it to the `environment` dataset"? OR ""extract the principal component (pc) variables for the `abundance`, `environment` datasets and JOIN the pc's together"? — mnm, Jul 24 '18 at 07:55
@TaAni are you asking me, "is it true that the variables are the vectors?" — mnm, Jul 24 '18 at 08:41
@Ashish: Sorry I have a problem with phrasing. No I am not. I have edited my question and hope that it got more specific now. — Ta Ani, Jul 24 '18 at 08:58
@TaAni no problem. After several re-readings, I have understood what you require. It is something like, "first extract the principal components for the `abundance` dataset. Then connect them with the environmental parameters in the `environment` dataframe. Is my understanding correct? — mnm, Jul 24 '18 at 09:01
@TaAni no problem at all. As they say, "it takes two to Tango" :) Continuing further, I suggest read the answer by user j-con on this [post](https://stackoverflow.com/questions/51379618/how-to-programmatically-determine-the-column-indices-of-principal-components-usi). The idea is to determine the principal components in a given dataset. — mnm, Jul 24 '18 at 09:13
@TaAni so you can do the following, `factoextra::fviz_screeplot(abu.h.pca, addlabels = TRUE, barfill = "gray", barcolor = "black", ylim = c(0, 50), xlab = "Principal Component", ylab = "Percentage of explained variance", main = "Principal Component (PC) for abundance dataset")`. And then `f<-factoextra::fviz_contrib(abu.h.pca, choice = "var", axes = c(1), top = 10, sort.val = c("desc")) ` — mnm, Jul 24 '18 at 09:15
@TaAni, and then `#save data from contribution plot dat<-f$data dat #filter out ID's that are higher than, say, 10 r<-rownames(dat[dat$contrib>10,]) new<- abundance[r] new` You will see the `new` dataframe has determined `site1` and `site2` are the main columns. — mnm, Jul 24 '18 at 09:17
@TaAni You can then merge the two dataframe's together like, `> newdf<-merge(new, environment, all = TRUE)` and `> newdf`. Let me know if this helps, if yes then I will post it as an answer. — mnm, Jul 24 '18 at 09:20
@Ashish I am not sure. I think it looses the individuals and they are the most important thing in the plot because I want to explain how the individuals group (by depth and or food availability...). — Ta Ani, Jul 24 '18 at 09:53
Can I extract the coordinates for the individuals (get_pca_ind and ind$coord) and for the variables and do a biplot(x,y)? — Ta Ani, Jul 24 '18 at 10:04
@TaAni see, if you do this `factoextra::fviz_screeplot(abu.h.pca, addlabels = TRUE, barfill = "gray", barcolor = "black", xlab = "Principal Component", ylab = "Percentage of explained variance", main = "Principal Component (PC) for abundance dataset")` In this plot you'll see the first PC1 contributes `68.3%` and the second PC2 contributes `29.3%` variance in the data. — mnm, Jul 24 '18 at 10:06
@TaAni, thereafter to see the contribution of these 2 PCs, you can do `factoextra::fviz_contrib(abu.h.pca, choice = "var", axes = c(1,2), top = 10, sort.val = c("desc"))` and you'll see the variables `site3` and `site1` are the prominent contributors. — mnm, Jul 24 '18 at 10:09
@Ashish sorry I am to slow to delete my comments in the meantime. #Graph of individuals fviz_pca_ind(abu.h.pca) ##get coordinates ind<-get_pca_ind(abu.h.pca) head(ind$coord) #x in biplot ind<-ind$coord ind<-ind[,1:2] ind #y variables # Extract the results for variables only vari<-get_pca_var(envir.PCA) var<-vari$coord var<-var[,1:2] var biplot(ind, var, var.axes = TRUE) With that I get the graph I would like to have, I just think that it is not "right" because I didn´t link anything just overlay 2 two column matrices... — Ta Ani, Jul 24 '18 at 10:17
@TaAni ok, try changing the following code `vari<-get_pca_var(envir.PCA)` to `vari<-get_pca_var(abu.h.pca)`. Everything else is ok. — mnm, Jul 24 '18 at 10:25
@Ashish: but than I am again at the beginning, because i get the wrong variables for my individuals. I would need a plot which looks exactly like the one which i create with vary<-get_pca_var(enviro.PCA) ;( — Ta Ani, Jul 24 '18 at 11:34
@Ashish: Besides the consisting problem, I am thankful for the application you showed me how to look at the data with factoextra and figure out which are the prominent contributors. I will apply that with a different dataset! — Ta Ani, Jul 24 '18 at 12:37
@TaAni, No problem. Try `vari<- get_pca_var(envir.PCA)`. In my previous comment I thought that `envir.PCA` was a typo, i.e. why I suggested to use `abu.h.pca`.I tried the following code `envir.PCA<- prcomp(environment, scale. = TRUE) vari<- get_pca_var(envir.PCA) var<-vari$coord var<-var[,1:2] var biplot(ind, var, var.axes = TRUE)`. I think this should do. — mnm, Jul 24 '18 at 12:44
@TaAni thereafter, you can do `f<-factoextra::fviz_contrib(envir.PCA, choice = "var", axes = c(1,2), top = 10, sort.val = c("desc"))` to plot the contributions. Then, save the data from the contributions plot, `dat<-f$data`. Then extract the row names of these contributions `r<-rownames(dat[dat$contrib>10,])`. And finally, map the contributions to the `environment` variable like `new.df<- environment[r]`. Look at `new.df` you will see output like `head(new, 2)` ` Temp Chla Plo Plo2 site1 24.0 2.2 1000 200 site2 24.5 1.5 2000 400` — mnm, Jul 24 '18 at 13:08
@Ashish: but i would still need a dataframe where the rows are not the sites but the species and the columns as in new.df which you created the environmental parameters. The plot "biplot(ind, var, var.axes = TRUE)" would be perfect. But does it make sense to overlay the coordinates of the variables of a second PCA over the coordinates of the individual from another PCA. The datasets are somehow linked via the sites parameter, but I don´t know if that is enough and the data stays interpretable? — Ta Ani, Jul 25 '18 at 12:10

score 0 · Answer 1 · answered Aug 20 '18 at 13:53

0

I found a solution: envfit {vegan}
Fits an Environmental Vector or Factor onto an Ordination

answered Aug 20 '18 at 13:53

Ta Ani

73
7

How to conduct PCA on a table with abundance data BUT overlay it with vectors of environmental parameters of the sites in R?

1 Answers1