What's the numerically best way to calculate the average

Question

what's the best way to calculate the average? With this question I want to know which algorithm for calculating the average is the best in a numerical sense. It should have the least rounding errors, should not be sensitive to over- or underflows and so on.

Thank you.

Additional information: incremental approaches preferred since the number of values may not fit into RAM (several parallel calculations on files larger than 4 GB).

Whoever voted to close as not constructive got it wrong big time. This is an excellent and appropriate question. — David Heffernan, Sep 26 '11 at 08:54
Note that the different algorithms presented aren't mutually exclusive. It's perfectly feasible to read 1 MB chunks, sort them, sum them, and then use Kahan summation over all the partial sums. — MSalters, Sep 26 '11 at 10:35
thank you for all of your comments. They helped me understanding my problem. I'll accept the paper as answer as it provides an analysis of different ways to handle the sum. — Tobias Langner, Sep 28 '11 at 07:21

score 13 · Answer 1 · answered Sep 26 '11 at 08:29

13

If you want an O(N) algorithm, look at Kahan summation.

answered Sep 26 '11 at 08:29

MSalters

173,980
10
155
350

Clearly O(N) is going to be faster than the sorting approach. Do you know whether or not Kahan is more or less accurate than sorting before summing? – David Heffernan Sep 26 '11 at 08:48
@David Heffernan: from the wikipedia entry: "Although Kahan's algorithm achieves O(1) error growth for summing n numbers, only slightly worse O(logn) growth can be achieved by pairwise summation: one recursively divides the set of numbers into two halves, sums each half, and then adds the two sums.". I doubt the sorting method method can do O(1). – Karoly Horvath Sep 26 '11 at 09:01
@yi_H I don't quite follow the logic there. Pairwise summation as described there doesn't seem to involve sorting. No matter, sorting is clearly very expensive compared to either Kahan or pairwise summation. – David Heffernan Sep 26 '11 at 09:06
@David: depends on the exact summing algorithm you're using after sorting. See http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.88.7857&rep=rep1&type=pdf for more details. 100% accuracy is possible (i.e. the only error is due to the finite amount of storage for the result) – MSalters Sep 26 '11 at 10:46

score 10 · Accepted Answer · answered Sep 26 '11 at 10:33

10

You can have a look at http://citeseer.ist.psu.edu/viewdoc/summary?doi=10.1.1.43.3535 (Nick Higham, "The accuracy of floating point summation", SIAM Journal of Scientific Computation, 1993).

If I remember it correctly, compensated summation (Kahan summation) is good if all numbers are positive, as least as good as sorting them and adding them in ascending order (unless there are very very many numbers). The story is much more complicated if some numbers are positive and some are negative, so that you get cancellation. In that case, there is an argument for adding them in descending order.

answered Sep 26 '11 at 10:33

Jitse Niesen

4,492
26
23

1

There's always the cheap trick of summing positive and negative numbers separately. In this case, the speed of the algorithm isn't that critical as long as it's O(N); the disk I/O will dominate almost any number of FP ops. – MSalters Sep 26 '11 at 11:01
3

@MSalters Why would you want to sum them separately? If you want to minimize round-off error, then the intermediate results should be as small as possible (in absolute value). Summing them separately has the opposite effect. – Jitse Niesen Sep 28 '11 at 08:33
2

As you note yourself, compensated summation is good if all numbers have the same sign. – MSalters Sep 29 '11 at 09:19
Compensated summation is still good with positive and negative terms. Its error is cond(S_n)u, where cond(S_n) is the condition number of summation and u is the unit roundoff. Regular summation has error ~ncond(S_n)u. If the sum is ill-conditioned none of the suggestions help. – user14717 May 07 '20 at 22:51
On second thought, that idea about separate summation of positive and negative parts still has issues. You get two accurate results, but can still get catastrophic cancellation in the final results when those sub-sums are approximately equal. – MSalters Mar 08 '21 at 08:52
And on further thought, if you separate the positive and negative terms, you get two compensation terms from Kahan summation. You shouldn't apply the compensation terms before adding the positive and negative sums together, but afterwards. – MSalters Feb 08 '22 at 11:04

score 5 · Answer 3 · answered Jun 03 '14 at 19:44

5

I always use the following pseudocode:

float mean=0.0; // could use doulbe
int n=0;  // could use long

for each x in data:
    ++n;
    mean+=(x-mean)/n;

I don't have formal proofs of its stability but you can see that we won't have problems with numerical overflow, assuming that the data values are well behaved. It's referred to in Knuth's The Art of Computer Programming

answered Jun 03 '14 at 19:44

Dave

7,555
8
46
88

1

Note that for `float`, this will be the less accurate the closer *n* gets to *2^23*. For example a sequence of 10 million values `1.0` followed by 10 million values `2.0` would average to `1.269`. This is because `(x-mean)/n` gets closer to zero, and `mean` does not change when added. – jpa Mar 06 '21 at 15:23
@jpa if you’re worried about that, you can always shuffle them first. – Dave Mar 06 '21 at 22:48
+1 as this is almost as simple as a naiive implementation without the accumlation buildup. Which Knuth book, volume 2? – Justin Meiners Apr 14 '21 at 20:59
1

This blogpost provides reasoning for this method. https://diego.assencio.com/?index=c34d06f4f4de2375658ed41f70177d59 – unique2 Aug 18 '21 at 17:45

score 5 · Answer 4 · answered Sep 26 '11 at 08:27

5

Sort the numbers in ascending order of magnitude. Sum them, low magnitude first. Divide by the count.

answered Sep 26 '11 at 08:27

David Heffernan

601,492
42
1,072
1,490

1

How does the sorting and adding numbers in ascending order make a difference here? – Jan S Sep 26 '11 at 08:29
why the sorting? isn't this prone to overflows (the sum might overflow)? – Tobias Langner Sep 26 '11 at 08:29
@Jan It reduces roundoff. Think about it. – David Heffernan Sep 26 '11 at 08:30
@Tobias Is overflow really a problem for you? – David Heffernan Sep 26 '11 at 08:31
@David: yes it can be. It is possible that the calculation is used for several millions of samples. – Tobias Langner Sep 26 '11 at 08:46
@Tobias So what? It's the size of the samples. You'd need the individual values to be order of magnitude 1e300 to be worried by overflow. – David Heffernan Sep 26 '11 at 08:47
@Jan S Roundoff occurs when you add (or subtract) two values of greatly differing magnitude. – David Heffernan Sep 26 '11 at 08:49
@Jan: simple example: one large number (e.g. 10^20) & lots of small numbers (e.g. 1). If you put the 10^20 as first number, adding the 1 will do nothing since it is rounded of - always. This may change the sum significantly. – Tobias Langner Sep 26 '11 at 08:52
Overflow is a valid concern given the edit of the question. When summing a billion doubles, your sum is of course a billion times larger than the average. That's 1E9, which would be an issue for single-precision floats (38-9) but generally isn't for double-precision floats (308-9) – MSalters Sep 26 '11 at 10:52
@MSalters Yes, I was assuming double precision. You make a good point about single. – David Heffernan Sep 26 '11 at 10:54
Overflow is even more relevant for half-precision floats (used more and more for GPUs with tensor cores). – isarandi May 13 '19 at 11:22

freeB · Answer 5 · 2020-12-23T11:42:34.057

A very late post, but since I don't have enough reputation to comment, @Dave's method is the one used (as at December 2020) by the Gnu Scientific Library.

Here is the code, extracted from mean_source.c:

double FUNCTION (gsl_stats, mean) (const BASE data[], const size_t stride, const size_t size)
{
/* Compute the arithmetic mean of a dataset using the recurrence relation mean_(n) = mean(n-1) + (data[n] - mean(n-1))/(n+1)   */

long double mean = 0;
size_t i;

for (i = 0; i < size; i++)
{
  mean += (data[i * stride] - mean) / (i + 1);
}

return mean;
}

GSL uses the same algorithm to calculate the variance, which is, after all, just a mean of squared differences from a given number.

score 3 · Answer 6 · answered Sep 26 '11 at 14:06

3

Just to add one possible answer for further discussion:

Incrementally calculate the average for each step:

AVG_n = AVG_(n-1) * (n-1)/n + VALUE_n / n

or pairwise combination

AVG_(n_a + n_b) = (n_a * AVG_a + n_b * AVG_b) / (n_a + n_b)

(I hope the formulas are clear enough)

answered Sep 26 '11 at 14:06

Tobias Langner

10,634
6
46
76

1

Admittedly, I have not worked it out myself, but the repeated divisions of the incremental form seems like it would generate more accuracy loss. Part of the issue is that `1/n` introduces errors in the least significant bits, so `n/n != 1`, at least when it is performed as a three step operation (divide-store-multiply). This is minimized if the division is only performed once, but you'd be doing it over GB of data. – rcollyer Oct 06 '11 at 16:39
With this formula you don't run into the risk of a overflow of the datatype. – Tobias Langner Aug 08 '12 at 10:03
Awesome formulas. Here's my implementation of the first one in C#: `double average(IList numbers, int avgUpTo1BasedIndex) { return (avgUpTo1BasedIndex == 1) ? numbers[0] : average(numbers, avgUpTo1BasedIndex - 1) * (avgUpTo1BasedIndex - 1) / avgUpTo1BasedIndex + (double)numbers[avgUpTo1BasedIndex - 1] / avgUpTo1BasedIndex; }` – ShawnFeatherly Oct 24 '14 at 20:00

What's the numerically best way to calculate the average

6 Answers6

Linked