0

I'm a new student in a bioinformatics lab, please feel free to correct me if anything is wrong.

I have made a CCA using the vegan package in R with the following script:

cca.analysis <- cca(mod ~ genus1 + genus2 + genus3, data)

I'm currently attempting to measure the scores/ contribution of each variable (genus) so I can determine which one was most influential to community variation in my dataset. I have two issues:

  1. How do you rescale the contribution of each genus irrespective of it's relative frequency to the other genera? For example, genus 1 is highly abundant compared to genus 3, which would mean that it will contribute more variation to the analysis.
  2. What script or function in the package would you use to measure the distance from the centroid to find the genus' contribution to variation?

Edit: I have made a reproducible example, to help give some insight about the question. Here is the genus data:

║ genus_1 ║ genus_2 ║ genus_3 ║ ║ 15.635 ║ 10.293 ║ 0 ║ ║ 9.7813 ║ 9.0061 ║ 5.4298 ║ ║ 15.896 ║ 2.5612 ║ 3.4335 ║ ║ 4.0054 ║ 0 ║ 2.0043 ║ ║ 15.929 ║ 16.213 ║ 0 ║ ║ 11.072 ║ 15.434 ║ 0 ║ ║ 12.539 ║ 7.2498 ║ 0 ║ ║ 9.1164 ║ 11.526 ║ 2.1649 ║ ║ 4.5011 ║ 0 ║ 0 ║ ║ 11.66 ║ 13.46 ║ 5.1416 ║

The mod part in the formula I provided corresponds to the following data, which I extracted from a PCoA analysis:

║ Coord_1 ║ Coord_2 ║ Coord_3 ║ Coord_4 ║ Coord_5 ║ Coord_6 ║ Coord_7 ║ ║ 0.954 ║ 0.928 ║ 0.952 ║ 1.009 ║ 1.016 ║ 0.943 ║ 1.031 ║ ║ 0.942 ║ 1.088 ║ 1.100 ║ 1.015 ║ 1.080 ║ 1.140 ║ 1.002 ║ ║ 0.932 ║ 0.989 ║ 1.005 ║ 0.974 ║ 0.990 ║ 1.047 ║ 1.035 ║ ║ 0.929 ║ 1.111 ║ 1.094 ║ 0.847 ║ 0.932 ║ 0.940 ║ 1.016 ║ ║ 0.947 ║ 1.008 ║ 0.937 ║ 1.055 ║ 1.056 ║ 0.964 ║ 1.022 ║ ║ 0.948 ║ 1.054 ║ 0.987 ║ 1.018 ║ 1.017 ║ 0.965 ║ 0.994 ║ ║ 0.946 ║ 1.023 ║ 0.911 ║ 1.014 ║ 1.062 ║ 1.076 ║ 1.063 ║ ║ 1.041 ║ 1.000 ║ 0.945 ║ 0.872 ║ 1.036 ║ 0.907 ║ 1.029 ║ ║ 0.926 ║ 1.107 ║ 1.027 ║ 0.943 ║ 0.993 ║ 1.006 ║ 0.947 ║ ║ 1.038 ║ 1.016 ║ 1.008 ║ 1.013 ║ 0.997 ║ 0.891 ║ 0.988 ║

You can plot this in R with function plot and this is hopefully get something like this: CCA plot

adam.bj
  • 11
  • 3
  • 1
    It's easier to help you if you include a simple [reproducible example](https://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example) with sample input and desired output that can be used to test and verify possible solutions. Just asking for package recommendations is off topic. Just describe the problem you need to solve. If there is a package that does that already, someone will include that in their answer. Or it's possible it easy to do without a package. You should just ask one clear question at a time when posting here. – MrFlick Jan 09 '19 at 22:16
  • Thanks for the tip, @MrFlick. I have now added more information to the question. I believe the question does not requre a redirection to a different package as both parts of it can be answered using the `vegan` package. – adam.bj Jan 09 '19 at 23:34

1 Answers1

0

Actually, the scaling of the constraining variables (genus1 etc) does not influence their contributions to the model. You can verify this by multiplying one of your constraints with some number (say 10) and comparing the resulting models and seeing that they do not change. What will change are the regression coefficients for constraints, but they are of no interest here (regression coefficient will change to cancel the effect of multiplication).

The key point is: what do you mean with "contribution"? If you mean how much each of these constraints "explains" of the total variation in the data, you can get this information from anova(cca.analysis, by = "terms") or alternatively from anova(cca.analysis, by = "margin"). The first analysis will be sequential decomposition of explained variation where the components add up to 100% of explained, and the latter decomposition to unique terms where the components do not add up to 100%. Up to three components (genus), you can also use varpart function (for cca with argument chisquare = TRUE: for this you need the latest vegan release) which decomposes the total explained variation into unique and joint contributions.

If you mean something else with "contribution", please explain.

Jari Oksanen
  • 3,287
  • 1
  • 11
  • 15
  • Thank you, @Jari Oksanen for the answer. Indeed, what I meant by "contribution" is the percent variance explained by each genus on the PCoAs that were previously calculated. However, could you help me determine the difference in using this method compared to one I found in a research article? I'm not sure if I'm allowed to post a link here, so I will post an excerpt from the article's supplementary information: "Genus contribution (score) was calculated as the distance from the genus CCA coordinate to the CCA centroid. Larger values indicate higher contributions to community variation." – adam.bj Jan 10 '19 at 19:30
  • This literature approach looks like something completely different. I assume that they used raw data instead of PCoA axes in CCA. However, they did not do it correctly: CCA is a weighted ordination method, and you should look at weighted squared distances from the origin if you want to have contributions of column variables (species) to the overall analysis. In **vegan** these can be calculated with `goodness()`. I think you should *not* use PCoA coordinated in CCA, but you could use RDA. However, I am not sure that this kind of contribution can be meaningfully calculated. – Jari Oksanen Jan 11 '19 at 14:23
  • I suspected that something wasn't right, but there wasn't enough information on the methods to determine that. In the paper (not supplementary material), it is stated for the results "Top 10 contributors to community variation as determined by canonical correspondence analysis on unscaled genera abundances, plotted on the two first PCoA dimensions (arrows scaled to contribution)." - I was hoping that would help me determine the most important genera in my community analyses, but it doesn't look like a reliable approach. I will try to do an RDA analysis and see what I get. – adam.bj Jan 11 '19 at 21:48