Wrong degrees of freedom in lsmeans and SE calculation in R

Question

I have this sample data:

Sample  Replication Days

    1   1   10
    1   1   14
    1   1   13
    1   1   14
    2   1   NA
    2   1   5
    2   1   18
    2   1   20
    1   2   16
    1   2   NA
    1   2   18
    1   2   21
    2   2   15
    2   2   7
    2   2   12
    2   2   14

I have four observations for each sample with a total of 64 samples in each of the two replications. In total, I have 512 values for both the replications. I also have some missing values designated as 'NA'. I prformed ANOVA for Mean values for each Sample for each Rep that I generated using

library(tidyverse)
df <- Data %>% group_by(Sample, Rep) %>% summarise(Mean = mean(Days, na.rm = TRUE)) 
curve.anova <- aov(Mean~Rep+Sample, data=df)

Result of anova is:

> summary(curve.anova) 
            Df Sum Sq Mean Sq F value Pr(>F)    
Rep          1    6.1   6.071   2.951 0.0915 .  
Sample        63 1760.5  27.945  13.585 <2e-16 ***
Residuals   54  111.1   2.057

I created a table for mean and SE values,

ANOVA<-lsmeans(curve.anova, ~Sample)
ANOVA<-summary(ANOVA)
write.csv(ANOVA, file="Desktop/ANOVA.csv")

A few lines from file are:

Sample  lsmean  SE  df  lower.CL    upper.CL
1       24.875  1.014145417 54  22.84176086 26.90823914
2       25.5    1.014145417 54  23.46676086 27.53323914
3       31.32575758 1.440722628 54  28.43728262 34.21423253
4       26.375  1.014145417 54  24.34176086 28.40823914
5       26.42424242 1.440722628 54  23.53576747 29.31271738
6       25.5    1.014145417 54  23.46676086 27.53323914
7       28.375  1.014145417 54  26.34176086 30.40823914
8       24.875  1.014145417 54  22.84176086 26.90823914
9       21.16666667 1.014145417 54  19.13342752 23.19990581
10      23.875  1.014145417 54  21.84176086 25.90823914

df for all 64 samples is 54 and the error bars in the ggplot are mostly equal for all the Samples. SE values are larger than the manually calculated values. Based on anova results, df=54 is for residuals.

I want to double check the ANOVA results so that they are correct and I am correctly generating lsmeans and SE to plot a bargraph using ggplot with confirdence interval error bars.

I will appreciate any help. Thank you!

It's easier to help you if you include a simple [reproducible example](https://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example) with sample input and desired output that can be used to test and verify possible solutions. — MrFlick, Dec 10 '19 at 16:12

dc37 · Answer 1 · 2019-12-10T22:07:23.637

After reading your comments, I think your workflow as an issue. Basically, when you are applying your anova test, you are doing it on means of the different samples. So, in your example, when you are doing :

curve.anova <- aov(Mean~Rep+Sample, data=df)

You are comparing these values:

> df
# A tibble: 4 x 3
# Groups:   Sample [2]
  Sample Replication  Mean
   <dbl>       <dbl> <dbl>
1      1           1  12.8
2      1           2  18.3
3      2           1  14.3
4      2           2  12

So, basically, you are comparing two groups with two values per group.

So, when you tried to remove the Replication group, you get an error because the output of:

df = Data %>% group_by(Sample %>% summarise(Mean = mean(Days, na.rm = TRUE))

is now:

# A tibble: 2 x 2
  Sample  Mean
   <dbl> <dbl>
1      1  15.1
2      2  13

So, applying anova test on that dataset means that you are comparing two groups with one value each. So, you can't compute residuals and SE.

Instead, you should do it on the full dataset without trying to calculate the mean first:

anova_data <- aov(Days~Sample+Replication, data=Data)
anova_data2 <- aov(Days~Sample, data=Data)

And their output are:

> summary(anova_data)
            Df Sum Sq Mean Sq F value Pr(>F)
Sample       1  16.07  16.071   0.713  0.416
Replication  1   9.05   9.054   0.402  0.539
Residuals   11 247.80  22.528               
2 observations deleted due to missingness

> summary(anova_data2)
            Df Sum Sq Mean Sq F value Pr(>F)
Sample       1  16.07   16.07   0.751  0.403
Residuals   12 256.86   21.41               
2 observations deleted due to missingness

Now, you can apply lsmeans:

A_d = summary(lsmeans(anova_data, ~Sample))
A_d2 = summary(lsmeans(anova_data2, ~Sample))

> A_d
 Sample lsmean  SE df lower.CL upper.CL
      1   15.3 1.8 11    11.29     19.2
      2   12.9 1.8 11     8.91     16.9

Results are averaged over the levels of: Replication 
Confidence level used: 0.95 

> A_d2
 Sample lsmean   SE df lower.CL upper.CL
      1   15.1 1.75 12    11.33     19.0
      2   13.0 1.75 12     9.19     16.8

Confidence level used: 0.95

It does not change a lot the mean and the SE (which is good because it means that your replicate are consistent and you don't have too much variabilities between those) but it reduces the confidence interval.

So, to plot it, you can:

library(ggplot2)
ggplot(A_d, aes(x=as.factor(Sample), y=lsmean)) + 
  geom_bar(stat="identity", colour="black") +
  geom_errorbar(aes(ymin = lsmean - SE, ymax = lsmean + SE), width = .5)

Based on your initial question, if you want to check that the output of ANOVA is correct, you can mimick fake data like this:

d2 <- data.frame(Sample = c(rep(1,10), rep(2,10)),
                 Days = c(rnorm(10, mean =3), rnorm(10, mean = 8)))

Then,

curve.d2 <- aov(Days ~ Sample, data = d2)
ANOVA2 <- lsmeans(curve.d2, ~Sample)
ANOVA2 <- summary(ANOVA2)

And you get the following output:

> summary(curve.d2)
            Df Sum Sq Mean Sq F value   Pr(>F)    
Sample       1 139.32  139.32   167.7 1.47e-10 ***
Residuals   18  14.96    0.83                     
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

> ANOVA2
 Sample lsmean    SE df lower.CL upper.CL
      1   2.62 0.288 18     2.02     3.23
      2   7.90 0.288 18     7.29     8.51

Confidence level used: 0.95

And for the plot

ggplot(ANOVA2, aes(x=as.factor(Sample), y=lsmean)) + 
    geom_bar(stat="identity", colour="black") +
    geom_errorbar(aes(ymin = lsmean - SE, ymax = lsmean + SE), width = .5)

As you can see, we get lsmeans for d2 close to 3 and 8 what we set at the first place. So, I think your output are correct. Maybe your data do not present any significant differences and the computation of SE are the same because the distribution of your data are the same. It is what it is.

I hope this answer helps you.

Data

df = data.frame(Sample = c(rep(1,4), rep(2,4),rep(1,4), rep(2,4)),
                Replication = c(rep(1,8), rep(2,8)),
                Days = c(10,14,13,14,NA,5,18,20,16,NA,18,21,15,7,12,14))

Thank you! As per ANOVA, there are significant differences among Samples (p value <2e-16 ***). I think SE are just quite similar throughout the data. — Jessica, Dec 10 '19 at 19:34
I agree. Just weird but so far, nothing seems to be wrong in your workflow. So, if something is wrong it's coming from the data in first place not from your analysis of it in R. Hope it helps you to get what you want. — dc37, Dec 10 '19 at 20:32
I wanted to know that if I have the same dataset without Rep. For example, the above dataset with eight values for each Sample. Then I calculated means for each Sample across all values using, Data %>% group_by(Sample) %>% summarise(Mean = mean(Days, na.rm = TRUE)). Then I ran ANOVA, aov(Mean~Sample, data=df). But there are no residuals calculated. ANOVA only shows results for Sample, no residuals. So I am not able to calculate lsmeans and SE. — Jessica, Dec 10 '19 at 20:53
I think I figure it out your issue. I edited my answer. Let me know if it looks satisfying to you. — dc37, Dec 10 '19 at 22:07
Okay, I tried your suggestion. It worked for second dataset with anova_data2. For the first dataset, I tried both ways. First method, calculating means, run anova, calculate lsmean & SE. Here ANOVA shows no Rep effect (p value 0.0915), significant Sample effect (p value <2e-16 ***) and residuals df is 54. Second method, run anova, find lsmeans. Here ANOVA shows significant Rep effect (p value 0.0428 * ), significant Sample effect (p value <2e-16 ***) and residuals df is 372. There are different values for df, lsmeans and SE for both ways. What do you think about this difference. — Jessica, Dec 10 '19 at 23:00
It's hard to interpret as it is your experiment, so you should know better what these data corresponds to. The fact that there is an effect in Rep could be a little bit intriguing and need to be a little bit more investigation. My personal opinion is that I will use the anova on the full dataset not on the mean of each group especially as you have a small number of values per group, so you can't assume that the mean is representative of the distribution of the population — dc37, Dec 11 '19 at 03:50
Thank you! I feel if I use anova on the entire dataset not just the means then I will be considering pseudoreplication and it increases the residual df to 372 instead of 54 if I use anova on the means. So I think using anova on the means will be suitable for my dataset. — Jessica, Dec 11 '19 at 15:23
It's your call, as I said you know better what represent these data for your project. Maybe you can find more thoughts on this subject on [CrossValidated](https://stats.stackexchange.com/) community. I hope my answer still help you to figure it out your problem — dc37, Dec 11 '19 at 15:27

Wrong degrees of freedom in lsmeans and SE calculation in R

1 Answers1