6

I am trying to understand how to work with ANOVAs and post-hoc tests in R. So far, I have used aov() and TukeyHSD() to analyse my data. Example:

uni2.anova <- aov(Sum_Uni ~ Micro, data= uni2)

uni2.anova

Call:
aov(formula = Sum_Uni ~ Micro, data = uni2)

Terms:
                    Micro  Residuals
Sum of Squares  0.04917262 0.00602925
Deg. of Freedom         15         48

Residual standard error: 0.01120756 
Estimated effects may be unbalanced

My problem is, now I have a huge list of pairwise comparisons but cannot do anything with it:

 TukeyHSD(uni2.anova)
 Tukey multiple comparisons of means
   95% family-wise confidence level

Fit: aov(formula = Sum_Uni ~ Micro, data = uni2)

$Micro
                               diff          lwr           upr     p adj
Act_Glu2-Act_Ala2     -0.0180017863 -0.046632157  0.0106285840 0.6448524
Ana_Ala2-Act_Ala2     -0.0250134285 -0.053643799  0.0036169417 0.1493629
NegI_Ala2-Act_Ala2     0.0702274527  0.041597082  0.0988578230 0.0000000

This dataset has 40 rows... Idealy, I would like to get a dataset that looks something like this:

  • Act_Glu2 : a
  • Act_Ala2 : a
  • NegI_Ala2: b...

I hope you get the point. So far, I have found nothing comparable online... I also tried to select only significant pairs in the file resulting from TukeyHSD, but the file does not "acknowlegde" that it is made up of rows & columns, making selecting impossible...

Maybe there is something fundamentally wrong with my approach?

Carolin
  • 63
  • 1
  • 4
  • What does "Act_Glu2:a" mean? How is it different from "Act_Glu2-Act_Ala2" – John Nov 02 '11 at 15:45
  • @John Ohh we might be off. The OP mentions "classify" in the title, but nowhere in the post. If she really wants to classify (cluster?) then she might be writing this to show she wants a list of the amino acids and the cluster they were assigned to (i.e. Act_Glu2 and Act_Ala2 are both in cluster "a"). I don't know though I could be totally wrong. At any rate, Carolin, can you clarify some on these points? – John Colby Nov 02 '11 at 16:08
  • @ John Colby: Yes, I think you understand what I mean. Act_Glu2 and Act_Ala2 show no significant difference in the Tukey test, hence they would be classified (or clustered, if that is the correct term) into the same group. NegI_Ala is significantly different to at least one of them, so if I plot the data, I would show this significance by adding "a" to the first two and "b" to the third data point. But as there are so many datasets, I would rather not do this manually... – Carolin Nov 02 '11 at 16:33

3 Answers3

7

I think the OP wants the letters to get a view of the comparisons.

library(multcompView)
multcompLetters(extract_p(TukeyHSD(uni2.anova)))

That will get you the letters.

You can also use the multcomp package

library(multcomp)
cld(glht(uni2.anova, linct = mcp(Micro = "Tukey")))

I hope this is what you need.

Luciano Selzer
  • 9,806
  • 3
  • 42
  • 40
  • With Carolin's clarification, I think this is the right track. – John Colby Nov 02 '11 at 20:19
  • 1
    With a little correction :) `hsd <- TukeyHSD(uni2.anova)` `multcompLetters(extract_p(hsd$Micro))` Because TukeyHSD(uni2.anova) results in more than just the list of pairwise comparisions & in this case hsd$Micro is just the list. – Carolin Nov 03 '11 at 10:34
2

The results from the TukeyHSD are a list. Use str to look at the structure. In your case you'll see that it's a list of one item and that item is basically a matrix. So, to extract the first column you'll want to save the TukeyHSD result

hsd <- TukeyHSD(uni2.anova)

If you look at str(hsd) you can that you can then get at bits...

hsd$Micro[,1]

That will give you the column of your differences. You should be able to extract what you want now.

John
  • 23,360
  • 7
  • 57
  • 83
  • Oh great! I have tried something like: `TukeyHSD(uni2.anova)[4,]` which returned "Wrong number of dimensions"... Thank you! – Carolin Nov 02 '11 at 16:24
  • Hm, if I want to select rows with specific attributes like this: `hsd$Micro[hsd$Micro[,4] < 0.05]` I don't get all the columns of hsd$Micro, just the 4th. – Carolin Nov 02 '11 at 17:08
  • Fixed it! `hsd$Micro[hsd$Micro[,4] < 0.05,]` – Carolin Nov 02 '11 at 17:09
1

Hard to tell without example data, but assuming Micro is just a factor with 4 levels and uni2 looks something like

n = 40
Micro = c('Act_Glu2', 'Act_Ala2', 'Ana_Ala2', 'NegI_Ala2')[sample(4, 40, rep=T)]
Sum_Uni = rnorm(n, 5, 0.5)
Sum_Uni[Micro=='Act_Glu2'] = Sum_Uni[Micro=='Act_Glu2'] + 0.5

uni2 = data.frame(Sum_Uni, Micro)
> uni2
   Sum_Uni     Micro
1 4.964061  Ana_Ala2
2 4.807680  Ana_Ala2
3 4.643279 NegI_Ala2
4 4.793383  Act_Ala2
5 5.307951 NegI_Ala2
6 5.171687  Act_Glu2
...

then I think what you're actually trying to get at is the basic multiple regression output:

fit = lm(Sum_Uni ~ Micro, data = uni2)

summary(fit)
anova(fit)
> summary(fit)

Call:
lm(formula = Sum_Uni ~ Micro, data = uni2)

Residuals:
     Min       1Q   Median       3Q      Max 
-1.26301 -0.35337 -0.04991  0.29544  1.07887 

Coefficients:
               Estimate Std. Error t value Pr(>|t|)    
(Intercept)      4.8364     0.1659  29.157  < 2e-16 ***
MicroAct_Glu2    0.9542     0.2623   3.638 0.000854 ***
MicroAna_Ala2    0.1844     0.2194   0.841 0.406143    
MicroNegI_Ala2   0.1937     0.2158   0.898 0.375239    
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 

Residual standard error: 0.4976 on 36 degrees of freedom
Multiple R-squared: 0.2891, Adjusted R-squared: 0.2299 
F-statistic:  4.88 on 3 and 36 DF,  p-value: 0.005996 

> anova(fit)
Analysis of Variance Table

Response: Sum_Uni
          Df Sum Sq Mean Sq F value   Pr(>F)   
Micro      3 3.6254 1.20847  4.8801 0.005996 **
Residuals 36 8.9148 0.24763                    
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 

You can access the numbers in any of these tables like, for example,

> summary(fit)$coef[2,4]
[1] 0.0008536287

To see the list of what is stored in each object, use names():

> names(summary(fit))
 [1] "call"          "terms"         "residuals"     "coefficients" 
 [5] "aliased"       "sigma"         "df"            "r.squared"    
 [9] "adj.r.squared" "fstatistic"    "cov.unscaled" 

In addition to the TukeyHSD() function you found, there are many other options for looking at the pairwise tests further, and correcting the p-values if desired. These include pairwise.table(), estimable() in gmodels, the resampling and boot packages, and others...

John Colby
  • 22,169
  • 4
  • 57
  • 69