ddply() is not giving the correct sd and se r

Question

I have a large dataset for which I want to determine the mean, sd and se depending on two variables (sample and protein), here is a subset of my data:

   sample    value protein
1 Stage 1 84796453   Tdrd6
2 Stage 1 85665703   Tdrd6

When I use

ddply(df, .(sample, protein), summarise, Mean = mean(value), SE = sd(value) / sqrt((length(value))), SD = sd(value))

I get

   sample protein     Mean       SE       SD
1 Stage 1   Tdrd6 85231078 434624.5 614651.9

The mean is correct, however, considering that I have only two values, the SD should be 434625 (the difference between the mean and either of the values, which is given in the output as SE), and (as calculated with excel) the SE should be 307326 (which is +-1/2 of the SD value given in the output). Does anyone know what is going on?

Thanks!

Please provide: https://stackoverflow.com/help/minimal-reproducible-example — deschen, Mar 02 '22 at 12:14

jdobres · Answer 1 · 2022-03-02T12:43:46.207

R's var and sd functions use a denominator of n - 1. From the var docs:

The denominator n - 1 is used which gives an unbiased estimator of the (co)variance for i.i.d. observations.

This is also why R's implementation of these functions will return NA for vectors of length 1. Your Excel calculations seem to be using an uncorrected denominator of n, hence the difference.

The bias correction is considered standard, especially for small samples. We can see the difference if we write a variance function that uses the biased denominator:

var_uncorrected <- function(x, na.rm = F) {
  return(mean((x - mean(x, na.rm = na.rm))^2))
}

vals <- c(84796453, 85665703)

sd(vals)
[1] 614652.6

sqrt(var_uncorrected(vals))
[1] 434625

Lastly, the plyr library was retired several years ago, and has been superseded by dplyr.

Merijn van Tilborg · Answer 2 · 2022-03-02T12:46:02.940

1

sd() calculates the standard deviation of a sample, which gives the correct answer. It seems you want the standard error of the population (assuming n is not just your sample size but is your whole population) which can be derived from it.

x = c(84796453, 85665703)
n = length(x)

sd(x) # standard error of a sample
# [1] 614653

sqrt((n-1)/n) * sd(x) # standard error of a population
# [1] 434625

edited Mar 02 '22 at 12:46

answered Mar 02 '22 at 12:39

Merijn van Tilborg

5,452
1
7
22

ddply() is not giving the correct sd and se r

2 Answers2