3

PCA on environmental parameters[Fig 4 below is what I would need as outcome, the other 2 figures show what I get from my data: PCA on environmental data or on the abundance data][1] No duplication ofR - how to make PCA biplot more readable or Plotting pca biplot with ggplot2

It is conducting abundance and environmental data in PCA simultaneously, working with two different data frames: I was told it shall work in the way, that from PCA1 you get the coordinates for the species and than with a second command you overlay over the species coordinates vectors not from PCA 1 but from another dataset that has the same sites as the first one not abundance but environmental data.

My prof did a PCA on abundance and environmental data 10 years ago. He overlaid the results of the principal components analysis (PCA) of abundance data of species with the correlations between the PCA scores and the environmental factors potentially influencing the ciliate distributions. How can I do that in R, when I have one data frame where the abundance of each species at the 33 sites are listed and one data frame where 12 different environmental parameters of the 33 sites are listed? So e.g. with the following data

#Create random dataframe of abundance data, I am sure this can be done simpler and more elegant than this ;)
species<-c("spec1", "spec2", "spec3", "spec 4", "spec 5", "spec 6", "spec7")
site1<-c(2,4,19,34,3,6,9)
site2<-c(5,8,9,12,0,1,1)
site3<-c(23,56,7,1,1,1,2)
site4<-c(4,6,2,8,5,1,7)
abundance<-data.frame(species,site1,site2,site3,site4)
rownames(abundance)<-abundance$species
abundance<-abundance[,-1]
#Create random dataframe of abundance data
#environmental parameters of the sites
X<-c("site1","site2","site3","site4")
Temp<-c(24,24.5,23.5,25)
Chla<-c(2.2,1.5,2.0,3.4)
Salinity<-c(24,25,26,23)
Depth<-c(200,400,600,200)
environment<-data.frame(X,Temp,Chla,Salinity,Depth)
rownames(environment)<-environment$X
environment<-environment[,-1]
###PCA on abundance data
#hellinger pre-transformation of abundance data
??decostand
library(vegan)
abu.h<-decostand(abundance,"hellinger")
abu.h.pca<-prcomp(abu.h)
biplot(abu.h.pca)
##and now I would need to discard the sites vectors and overlay it with 
#the environmental sites factors, due to my prof?
?prcomp
envir.PCA<-prcomp(environment,scale = TRUE)
biplot(envir.PCA)
?biplot

enter image description here

Ta Ani
  • 73
  • 7
  • 2
    Thank you for using a reproducible example in your first question! Good work! One tiny suggestion - in the future it may help to hyperlink or explain words that are specific to your field of study, like ciliate or (this context of) abundance. Finally, this doesn't sound like it is an IDE related problem, so the `rstudio` tag is probably inappropriate. – Hack-R Jul 23 '18 at 14:48
  • Also, is this different from the general case of PCA bi-plots and overlaying data? If not, there are several existing answers already on StackOverflow – Hack-R Jul 23 '18 at 14:54
  • Possible duplicate of [R - how to make PCA biplot more readable](https://stackoverflow.com/questions/17055291/r-how-to-make-pca-biplot-more-readable) – Hack-R Jul 23 '18 at 14:56
  • Possible duplicate of [Plotting pca biplot with ggplot2](https://stackoverflow.com/questions/6578355/plotting-pca-biplot-with-ggplot2) – AkselA Jul 23 '18 at 15:03
  • Thanks a lot for your help and the editing! It is a pity, but its no duplicate of the themes mentioned above. They are all working with one dataset! I have two different datasets. – Ta Ani Jul 23 '18 at 19:15
  • @TaAni I'm trying to understand what you want. Do you mean the following, "extract the principal component variables for the `abundance` dataset and then map it to the `environment` dataset"? OR ""extract the principal component (pc) variables for the `abundance`, `environment` datasets and JOIN the pc's together"? – mnm Jul 24 '18 at 07:55
  • @TaAni are you asking me, "is it true that the variables are the vectors?" – mnm Jul 24 '18 at 08:41
  • @Ashish: Sorry I have a problem with phrasing. No I am not. I have edited my question and hope that it got more specific now. – Ta Ani Jul 24 '18 at 08:58
  • @TaAni no problem. After several re-readings, I have understood what you require. It is something like, "first extract the principal components for the `abundance` dataset. Then connect them with the environmental parameters in the `environment` dataframe. Is my understanding correct? – mnm Jul 24 '18 at 09:01
  • @Ashish: Yes, thank you for trying and rereading! – Ta Ani Jul 24 '18 at 09:05
  • @TaAni no problem at all. As they say, "it takes two to Tango" :) Continuing further, I suggest read the answer by user j-con on this [post](https://stackoverflow.com/questions/51379618/how-to-programmatically-determine-the-column-indices-of-principal-components-usi). The idea is to determine the principal components in a given dataset. – mnm Jul 24 '18 at 09:13
  • @TaAni so you can do the following, `factoextra::fviz_screeplot(abu.h.pca, addlabels = TRUE, barfill = "gray", barcolor = "black", ylim = c(0, 50), xlab = "Principal Component", ylab = "Percentage of explained variance", main = "Principal Component (PC) for abundance dataset")`. And then `f<-factoextra::fviz_contrib(abu.h.pca, choice = "var", axes = c(1), top = 10, sort.val = c("desc")) ` – mnm Jul 24 '18 at 09:15
  • @TaAni, and then `#save data from contribution plot dat<-f$data dat #filter out ID's that are higher than, say, 10 r<-rownames(dat[dat$contrib>10,]) new<- abundance[r] new` You will see the `new` dataframe has determined `site1` and `site2` are the main columns. – mnm Jul 24 '18 at 09:17
  • @TaAni You can then merge the two dataframe's together like, `> newdf<-merge(new, environment, all = TRUE)` and `> newdf`. Let me know if this helps, if yes then I will post it as an answer. – mnm Jul 24 '18 at 09:20
  • @Ashish I am not sure. I think it looses the individuals and they are the most important thing in the plot because I want to explain how the individuals group (by depth and or food availability...). – Ta Ani Jul 24 '18 at 09:53
  • Can I extract the coordinates for the individuals (get_pca_ind and ind$coord) and for the variables and do a biplot(x,y)? – Ta Ani Jul 24 '18 at 10:04
  • @TaAni see, if you do this `factoextra::fviz_screeplot(abu.h.pca, addlabels = TRUE, barfill = "gray", barcolor = "black", xlab = "Principal Component", ylab = "Percentage of explained variance", main = "Principal Component (PC) for abundance dataset")` In this plot you'll see the first PC1 contributes `68.3%` and the second PC2 contributes `29.3%` variance in the data. – mnm Jul 24 '18 at 10:06
  • @TaAni, thereafter to see the contribution of these 2 PCs, you can do `factoextra::fviz_contrib(abu.h.pca, choice = "var", axes = c(1,2), top = 10, sort.val = c("desc"))` and you'll see the variables `site3` and `site1` are the prominent contributors. – mnm Jul 24 '18 at 10:09
  • @TaAni can you post the code for calculating `ind`? – mnm Jul 24 '18 at 10:13
  • @Ashish sorry I am to slow to delete my comments in the meantime. #Graph of individuals fviz_pca_ind(abu.h.pca) ##get coordinates ind<-get_pca_ind(abu.h.pca) head(ind$coord) #x in biplot ind<-ind$coord ind<-ind[,1:2] ind #y variables # Extract the results for variables only vari<-get_pca_var(envir.PCA) var<-vari$coord var<-var[,1:2] var biplot(ind, var, var.axes = TRUE) With that I get the graph I would like to have, I just think that it is not "right" because I didn´t link anything just overlay 2 two column matrices... – Ta Ani Jul 24 '18 at 10:17
  • @TaAni ok, try changing the following code `vari<-get_pca_var(envir.PCA)` to `vari<-get_pca_var(abu.h.pca)`. Everything else is ok. – mnm Jul 24 '18 at 10:25
  • @Ashish: but than I am again at the beginning, because i get the wrong variables for my individuals. I would need a plot which looks exactly like the one which i create with vary<-get_pca_var(enviro.PCA) ;( – Ta Ani Jul 24 '18 at 11:34
  • @Ashish: Besides the consisting problem, I am thankful for the application you showed me how to look at the data with factoextra and figure out which are the prominent contributors. I will apply that with a different dataset! – Ta Ani Jul 24 '18 at 12:37
  • @TaAni, No problem. Try `vari<- get_pca_var(envir.PCA)`. In my previous comment I thought that `envir.PCA` was a typo, i.e. why I suggested to use `abu.h.pca`.I tried the following code `envir.PCA<- prcomp(environment, scale. = TRUE) vari<- get_pca_var(envir.PCA) var<-vari$coord var<-var[,1:2] var biplot(ind, var, var.axes = TRUE)`. I think this should do. – mnm Jul 24 '18 at 12:44
  • @TaAni thereafter, you can do `f<-factoextra::fviz_contrib(envir.PCA, choice = "var", axes = c(1,2), top = 10, sort.val = c("desc"))` to plot the contributions. Then, save the data from the contributions plot, `dat<-f$data`. Then extract the row names of these contributions `r<-rownames(dat[dat$contrib>10,])`. And finally, map the contributions to the `environment` variable like `new.df<- environment[r]`. Look at `new.df` you will see output like `head(new, 2)` ` Temp Chla Plo Plo2 site1 24.0 2.2 1000 200 site2 24.5 1.5 2000 400` – mnm Jul 24 '18 at 13:08
  • @Ashish: but i would still need a dataframe where the rows are not the sites but the species and the columns as in new.df which you created the environmental parameters. The plot "biplot(ind, var, var.axes = TRUE)" would be perfect. But does it make sense to overlay the coordinates of the variables of a second PCA over the coordinates of the individual from another PCA. The datasets are somehow linked via the sites parameter, but I don´t know if that is enough and the data stays interpretable? – Ta Ani Jul 25 '18 at 12:10

1 Answers1

0

I found a solution: envfit {vegan}
Fits an Environmental Vector or Factor onto an Ordination

Ta Ani
  • 73
  • 7