20

I am curious to know what algorithm R's mean function uses. Is there some reference to the numerical properties of this algorithm?

I found the following C code in summary.c:do_summary():

case REALSXP:
PROTECT(ans = allocVector(REALSXP, 1));
for (i = 0; i < n; i++) s += REAL(x)[i];
s /= n;
if(R_FINITE((double)s)) {
    for (i = 0; i < n; i++) t += (REAL(x)[i] - s);
    s += t/n;
}
REAL(ans)[0] = s;
break;

It seems to do a straight up mean:

for (i = 0; i < n; i++) s += REAL(x)[i];
s /= n;

Then it adds what i assume is a numerical correction which seems to be the mean difference from the mean of the data:

for (i = 0; i < n; i++) t += (REAL(x)[i] - s);
s += t/n;

I haven't been able to track this algorithm down anywhere (mean is not a great search term).

Any help would be much appreciated.

Zach
  • 2,445
  • 3
  • 20
  • 25
  • This is an aside, but how does 'mean.R' call 'summary.c'? I do not understand how '.Internal(mean(x))' calls 'summary.c'. Thanks for any pointers on how the two files are connected. Sorry if this is too far removed from your question. I am just hoping to learn. – Mark Miller Jul 25 '13 at 20:11
  • 3
    @MarkMiller: All the `.Internal` calls are mapped in `src/main/names.c`. Search that file for "mean" and you'll see what C function calls it. Then you can search the source files for that C function. See [this question](http://stackoverflow.com/q/13279256/271616). – Joshua Ulrich Jul 25 '13 at 20:12
  • Link to this same question on r-devel: https://stat.ethz.ch/pipermail/r-devel/2013-July/067053.html – Brian Diggs Jul 26 '13 at 17:23

2 Answers2

15

I'm not sure what algorithm this is, but Martin Maechler mentioned the updating method of West, 1979 in response to PR#1228, which was implemented by Brian Ripley in R-2.3.0. I couldn't find a reference in the source code or version control logs that listed the actual algorithm used. It was implemented in cov.c in revision 37389 and in summary.c in revision 37393.

Joshua Ulrich
  • 173,410
  • 32
  • 338
  • 418
  • Thanks for pointing me in the right direction, I'll have to track down a copy of this paper. – Zach Jul 25 '13 at 20:18
  • 1
    I don't think that is West's method. I just downloaded the paper and West proposes an one pass method for computing the variance - R uses a two-pass method for computing the mean. – hadley Jul 26 '13 at 13:22
  • We're talking about [this revision](https://github.com/wch/r-source/commit/54ddc5dc90c795e08c81607c80ed545aa92a8c5a), right? – hadley Jul 26 '13 at 13:27
  • @hadley: yes, that's the revision. Thanks for checking West's algorithm. – Joshua Ulrich Jul 26 '13 at 14:22
11

I believe the R algorithm works as follows.

The first standard calculation of the mean is effectively an estimate of the algebraic mean, due to floating point errors (which gets worse the further the sum gets away from the elements being accumulated).

The second pass sums the differences of the elements from the estimated mean. There should be no net difference as the values on either side of the mean should balance out, but we have floating point error. The differences from the mean still have the potential for error, but these should be smaller than the worst potential difference between a element and the accumulating sum (at least the estimated mean lives somewhere within the range of values, while the summation may escape it). Dividing by N gives you the average difference from the mean, which you then use to nudge your initial estimate closer to the true mean. You could repeat this to get closer and closer, but at some point the floating point error in calculating the average difference from the mean will defeat you. I guess one pass is close enough.

This was explained to me by my wife.

I'm not sure where what the source of the algorithm is, and I'm not sure how this compares to other methods, such as Kahan summation. I guess I'll have to do some testing.

Zach
  • 2,445
  • 3
  • 20
  • 25