26

what's the best way to calculate the average? With this question I want to know which algorithm for calculating the average is the best in a numerical sense. It should have the least rounding errors, should not be sensitive to over- or underflows and so on.

Thank you.


Additional information: incremental approaches preferred since the number of values may not fit into RAM (several parallel calculations on files larger than 4 GB).

Tobias Langner
  • 10,634
  • 6
  • 46
  • 76
  • 9
    Whoever voted to close as not constructive got it wrong big time. This is an excellent and appropriate question. – David Heffernan Sep 26 '11 at 08:54
  • 3
    Note that the different algorithms presented aren't mutually exclusive. It's perfectly feasible to read 1 MB chunks, sort them, sum them, and then use Kahan summation over all the partial sums. – MSalters Sep 26 '11 at 10:35
  • thank you for all of your comments. They helped me understanding my problem. I'll accept the paper as answer as it provides an analysis of different ways to handle the sum. – Tobias Langner Sep 28 '11 at 07:21

6 Answers6

13

If you want an O(N) algorithm, look at Kahan summation.

MSalters
  • 173,980
  • 10
  • 155
  • 350
  • Clearly O(N) is going to be faster than the sorting approach. Do you know whether or not Kahan is more or less accurate than sorting before summing? – David Heffernan Sep 26 '11 at 08:48
  • @David Heffernan: from the wikipedia entry: "Although Kahan's algorithm achieves O(1) error growth for summing n numbers, only slightly worse O(logn) growth can be achieved by pairwise summation: one recursively divides the set of numbers into two halves, sums each half, and then adds the two sums.". I doubt the sorting method method can do O(1). – Karoly Horvath Sep 26 '11 at 09:01
  • @yi_H I don't quite follow the logic there. Pairwise summation as described there doesn't seem to involve sorting. No matter, sorting is clearly very expensive compared to either Kahan or pairwise summation. – David Heffernan Sep 26 '11 at 09:06
  • @David: depends on the exact summing algorithm you're using after sorting. See http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.88.7857&rep=rep1&type=pdf for more details. 100% accuracy is possible (i.e. the only error is due to the finite amount of storage for the result) – MSalters Sep 26 '11 at 10:46
10

You can have a look at http://citeseer.ist.psu.edu/viewdoc/summary?doi=10.1.1.43.3535 (Nick Higham, "The accuracy of floating point summation", SIAM Journal of Scientific Computation, 1993).

If I remember it correctly, compensated summation (Kahan summation) is good if all numbers are positive, as least as good as sorting them and adding them in ascending order (unless there are very very many numbers). The story is much more complicated if some numbers are positive and some are negative, so that you get cancellation. In that case, there is an argument for adding them in descending order.

Jitse Niesen
  • 4,492
  • 26
  • 23
  • 1
    There's always the cheap trick of summing positive and negative numbers separately. In this case, the speed of the algorithm isn't that critical as long as it's O(N); the disk I/O will dominate almost any number of FP ops. – MSalters Sep 26 '11 at 11:01
  • 3
    @MSalters Why would you want to sum them separately? If you want to minimize round-off error, then the intermediate results should be as small as possible (in absolute value). Summing them separately has the opposite effect. – Jitse Niesen Sep 28 '11 at 08:33
  • 2
    As you note yourself, compensated summation is good if all numbers have the same sign. – MSalters Sep 29 '11 at 09:19
  • Compensated summation is still good with positive and negative terms. Its error is cond(S_n)u, where cond(S_n) is the condition number of summation and u is the unit roundoff. Regular summation has error ~ncond(S_n)u. If the sum is ill-conditioned none of the suggestions help. – user14717 May 07 '20 at 22:51
  • On second thought, that idea about separate summation of positive and negative parts still has issues. You get two accurate results, but can still get catastrophic cancellation in the final results when those sub-sums are approximately equal. – MSalters Mar 08 '21 at 08:52
  • And on further thought, if you separate the positive and negative terms, you get two compensation terms from Kahan summation. You shouldn't apply the compensation terms before adding the positive and negative sums together, but afterwards. – MSalters Feb 08 '22 at 11:04
5

I always use the following pseudocode:

float mean=0.0; // could use doulbe
int n=0;  // could use long

for each x in data:
    ++n;
    mean+=(x-mean)/n;

I don't have formal proofs of its stability but you can see that we won't have problems with numerical overflow, assuming that the data values are well behaved. It's referred to in Knuth's The Art of Computer Programming

Dave
  • 7,555
  • 8
  • 46
  • 88
  • 1
    Note that for `float`, this will be the less accurate the closer *n* gets to *2^23*. For example a sequence of 10 million values `1.0` followed by 10 million values `2.0` would average to `1.269`. This is because `(x-mean)/n` gets closer to zero, and `mean` does not change when added. – jpa Mar 06 '21 at 15:23
  • @jpa if you’re worried about that, you can always shuffle them first. – Dave Mar 06 '21 at 22:48
  • +1 as this is almost as simple as a naiive implementation without the accumlation buildup. Which Knuth book, volume 2? – Justin Meiners Apr 14 '21 at 20:59
  • 1
    This blogpost provides reasoning for this method. https://diego.assencio.com/?index=c34d06f4f4de2375658ed41f70177d59 – unique2 Aug 18 '21 at 17:45
5

Sort the numbers in ascending order of magnitude. Sum them, low magnitude first. Divide by the count.

David Heffernan
  • 601,492
  • 42
  • 1,072
  • 1,490
3

A very late post, but since I don't have enough reputation to comment, @Dave's method is the one used (as at December 2020) by the Gnu Scientific Library.

Here is the code, extracted from mean_source.c:

double FUNCTION (gsl_stats, mean) (const BASE data[], const size_t stride, const size_t size)
{
/* Compute the arithmetic mean of a dataset using the recurrence relation mean_(n) = mean(n-1) + (data[n] - mean(n-1))/(n+1)   */

long double mean = 0;
size_t i;

for (i = 0; i < size; i++)
{
  mean += (data[i * stride] - mean) / (i + 1);
}

return mean;
}

GSL uses the same algorithm to calculate the variance, which is, after all, just a mean of squared differences from a given number.

freeB
  • 91
  • 3
3

Just to add one possible answer for further discussion:

Incrementally calculate the average for each step:

AVG_n = AVG_(n-1) * (n-1)/n + VALUE_n / n

or pairwise combination

AVG_(n_a + n_b) = (n_a * AVG_a + n_b * AVG_b) / (n_a + n_b)

(I hope the formulas are clear enough)

Tobias Langner
  • 10,634
  • 6
  • 46
  • 76
  • 1
    Admittedly, I have not worked it out myself, but the repeated divisions of the incremental form seems like it would generate more accuracy loss. Part of the issue is that `1/n` introduces errors in the least significant bits, so `n/n != 1`, at least when it is performed as a three step operation (divide-store-multiply). This is minimized if the division is only performed once, but you'd be doing it over GB of data. – rcollyer Oct 06 '11 at 16:39
  • With this formula you don't run into the risk of a overflow of the datatype. – Tobias Langner Aug 08 '12 at 10:03
  • Awesome formulas. Here's my implementation of the first one in C#: `double average(IList numbers, int avgUpTo1BasedIndex) { return (avgUpTo1BasedIndex == 1) ? numbers[0] : average(numbers, avgUpTo1BasedIndex - 1) * (avgUpTo1BasedIndex - 1) / avgUpTo1BasedIndex + (double)numbers[avgUpTo1BasedIndex - 1] / avgUpTo1BasedIndex; }` – ShawnFeatherly Oct 24 '14 at 20:00