How to compute mean (average) robustly?

Question

If we compute the mean naively:

std::vector<double> values;
double sum = std::accumulate(begin(values), end(values), 0.0);
double mean = sum / values.size();

and values.size() is big, we can get inaccurate results, since the floating point numbers have less resolution in higher ranges. Or worse, if I understand it correctly, we can get an infinite result.

When we have an even number of values, we can compute the mean of the first half, then the second and find the mean of these two means.

This doesn't seem to be a new problem, but I have trouble finding resources. I think there are more sophisticated techniques with trade-offs in

robustness
computational complexity
difficulty to implement

and I wonder if someone has summarized them somewhere or even better if they are available in some library.

You could just sort the vector in ascending order of magnitude and then sum it - that way you accumulate all the smallest values first, to minimise rounding errors. — Paul R, May 22 '14 at 17:16
This might help: http://invisibleblocks.com/2008/07/30/long-running-averages-without-the-sum-of-preceding-values/ — bytefire, May 22 '14 at 17:22
@sds: Nobody is forcing you to use a comparison-based sort here. — tmyklebu, May 22 '14 at 17:24

score 9 · Accepted Answer · answered May 22 '14 at 17:20

9

You can use an online algorithm as described here.

Basically (in pythonish pseudo-code):

n = 0
mean = 0

for value in data:
    n += 1
    mean += (value - mean)/n

This algorithm is more numerically stable than the naïve implementation.

answered May 22 '14 at 17:20

Juan Lopes

10,143
2
25
44

1

Can you justify why this is more stable than the naive implementation? Seems like you'd have to assume that the input data are nicely distributed for that. – tmyklebu May 22 '14 at 17:23
3

@LieRyan: Use a `double` for `n`. This also saves you form the int-to-float conversions. – tmyklebu May 22 '14 at 17:27
@LieRyan I don't see why it "will almost certainly ... when `values.size()` is big", `unsigned long long n` is enough for **18,446,744,073,709,551,614** (or 1.84 *10e19) elements in which case you probably have other problems already. – AliciaBytes May 22 '14 at 18:24
1

@LieRyan: That's patently false, and for some reason my last comment got deleted. – tmyklebu May 22 '14 at 18:54
1

This can overflow. Suppose `data={x, -x}`. If `x` is large enough, the second iteration of your loop will overflow when computing `value - mean`. Another way this can go wrong is when `data` has many elements. After a while dividing by `n` will mean a huge loss of precision. – a06e Aug 04 '15 at 21:02

score 9 · Answer 2 · answered May 22 '14 at 17:22

Lots of stupid things can happen here. One problem is the overflow thing. Another is exemplified by this: (1e100 + 1) - 1e100) == 0. Another is just accumulated roundoff.

Kahan summation handles accumulated roundoff very well for well-scaled data. Find the sum using Kahan summation then divide by the number of data.

To deal with poorly-scaled data, you might bucket the data by exponent (say 50 different buckets each covering about 20 different exponents) and Kahan-sum in descending bucket order.

This is all massive overkill, of course, and it's rather slow. In practice, using vector instructions and stuff like that helps with speed and with precision.

thus spake a.k. · Answer 3 · 2014-05-22T22:01:36.157

If you're willing to muck about with values in the process, a simple and robust scheme is to first sort it by magnitude:

struct fabs_less
{
    bool
    operator()(const double x0, const double x1) const
    {
        return fabs(x0)<fabs(x1);
    }
};

std::sort(values.begin(), values.end(), fabs_less());
const double sum = std::accumulate(values.begin(), values.end(), 0.0);
const double mean = sum / double(values.size());

This increases the computational complexity to N log N but results in the minimum possible rounding error.

Edit: tmyklebu makes a very good point with a degenerate case (curses that I missed it). Instead accumulate the negative and positive terms seperately in order of increasing magnitude:

std::sort(values.begin(), values.end());
std::vector<double>::const_iterator mid = std::upper_bound(values.begin(), values.end(), 0.0);
std::reverse_iterator<std::vector<double>::const_iterator> rmid(mid);
const double neg = std::accumulate(rmid, values.rend(), 0.0);
const double pos = std::accumulate(mid, values.end(), 0.0);
const double mean = (neg+pos) / double(values.size());

This introduces the possibility of cancellation error in neg+pos, but will still have a small error relative to the sum of the absolute values of the elements of values which I think is the best that you can hope for without some seriously complicated logic...

I don't believe it does. Try adding up a bunch of `1`s and a bunch of `-1 - 0x1p-52`'s with this scheme. — tmyklebu, May 22 '14 at 21:19

score 2 · Answer 4 · edited May 23 '17 at 12:21

2

Generally, a divide and conquer technique (a recursive split in two parts) is robust.

See my answer to Precise sum of floating point numbers where I demonstrate it with a recursive form.

Note that there is no recursive tail call elimination in C/C++, so this implementation is not necessarily efficient (it leads to a deep stack).

edited May 23 '17 at 12:21

Community

1
1

answered May 22 '14 at 18:50

aka.nice

9,100
1
28
40

1

There is most certainly recursive tail call elimination in C and C++, though your `sumpairs` does not fully qualify for recursive tail call elimination since it calls itself twice. – Mooing Duck May 22 '14 at 21:45

score 1 · Answer 5 · answered May 22 '14 at 17:29

Pardon, not doing this as comment due to length. A double value value usually has more than 50 bits of precision. You're talking about 1 part in a trillion or more.

The resolution of a floating point number remains the same on a fractional basis throughout its range.

But, if you add 1234E40 to 1234E-040 you're going to get 1234E40. Adding values of different orders of magnitude will through an average off. However, the amount it will be off is usually so small (trillionths) that it is rarely noticeable.

In nearly all cases, you can do an average simply by adding and dividing by the count and get a very precise answer.

You might even be able to do a long double on your systems.

If you have some data set where this is not the case, maybe you can describe that data set and the problems it presents. From that, we could come up with a solution to your particular problem.

The problem is that you may have so many small numbers that their sum becomes significant in relation to the larger numbers. (If you're off by a trillionth a quintillion times, it may well be significant.) — molbdnilo, May 22 '14 at 21:15
That does not happen in most circumstances. If the OP stated that he had data of that nature, you could incur the overhead of adjusting. In most measurements, one does not encounter wide variations that cause that problem. — user3344003, May 23 '14 at 13:37

How to compute mean (average) robustly?

5 Answers5