Why am I getting different means using different functions in R?

Question

I have data showing beta values (neural activity) as a continuous dv and a categorical iv (Position) with several levels (1, 2, 3). I have created a summary dataframe that includes the means for each condition.

    df_sum <- df %>%
  group_by(Position) %>%
  summarise( 
    n=n(),
    mean=mean(Beta),
    sd=sd(Beta)
  ) %>%
  mutate( se=sd/sqrt(n))  %>%
  mutate( ic=se * qt((1-0.05)/2 + .5, n-1))

This is a reanalysis of some old data originally analysed in SPSS, and the means from the summarise function agree with those previously reported. However, when I run t-tests with either emmeans

emmeans((lm(Beta ~ Position, data = df)), pairwise ~ Position)

or t.test

t.test(df$Beta[df$Position=="1"], df$Beta[df$Position=="2"])

the output reports means that are the same as each other but slightly different to the ones calculated by summarise and reported by the original study. Why is this happening (or really, how??), and which should I use?

Hi @danielb, could you post your data? It's difficult for us to help without it — Tob, Jun 13 '22 at 14:03
Questions on SO (especially in R) do much better if they are reproducible and self-contained. By that I mean including sample representative data (perhaps via `dput(head(x))` or building data programmatically (e.g., `data.frame(...)`), possibly stochastically), perhaps actual output (with verbatim errors/warnings) versus intended output. Refs: https://stackoverflow.com/q/5963269, [mcve], and https://stackoverflow.com/tags/r/info. — r2evans, Jun 13 '22 at 14:51

Russ Lenth · Accepted Answer · 2022-06-14T18:11:53.327

I suspect that you are seeing means that are identical, but the confidence intervals for them differ. And that's because the emmeans results are based on the model fitted by lm, which estimates a common error SD, with more degrees of freedom. Either set of results is fine, but they are based in different models. If he SDs don't differ statistically, you are better off with the lm results because they make more efficient use of the data.

You can reproduce the SEs provided by emmeans as follows:

df_sum %>% 
    mutate(sp = sqrt(sum((n-1)*sd^2) / sum(n-1))) %>%
    group_by(Position) %>%
    mutate(se.p = sp/sqrt(n))

In these steps, we

compute the pooled SD, sp via a weighted average of the variances
Divide into groups
Compute se.p, the SE based on the pooled s

Why am I getting different means using different functions in R?

1 Answers1