10

If we compute the mean naively:

std::vector<double> values;
double sum = std::accumulate(begin(values), end(values), 0.0);
double mean = sum / values.size();

and values.size() is big, we can get inaccurate results, since the floating point numbers have less resolution in higher ranges. Or worse, if I understand it correctly, we can get an infinite result.

When we have an even number of values, we can compute the mean of the first half, then the second and find the mean of these two means.

This doesn't seem to be a new problem, but I have trouble finding resources. I think there are more sophisticated techniques with trade-offs in

  • robustness
  • computational complexity
  • difficulty to implement

and I wonder if someone has summarized them somewhere or even better if they are available in some library.

Martin Drozdik
  • 12,742
  • 22
  • 81
  • 146
  • 4
    You could just sort the vector in ascending order of magnitude and then sum it - that way you accumulate all the smallest values first, to minimise rounding errors. – Paul R May 22 '14 at 17:16
  • 2
    @PaulR: sorting is `O(n*log(n))` while averaging is `O(n)`. – sds May 22 '14 at 17:19
  • This might help: http://invisibleblocks.com/2008/07/30/long-running-averages-without-the-sum-of-preceding-values/ – bytefire May 22 '14 at 17:22
  • 2
    @sds: Nobody is forcing you to use a comparison-based sort here. – tmyklebu May 22 '14 at 17:24

5 Answers5

9

You can use an online algorithm as described here.

Basically (in pythonish pseudo-code):

n = 0
mean = 0

for value in data:
    n += 1
    mean += (value - mean)/n

This algorithm is more numerically stable than the naïve implementation.

Juan Lopes
  • 10,143
  • 2
  • 25
  • 44
  • 1
    Can you justify why this is more stable than the naive implementation? Seems like you'd have to assume that the input data are nicely distributed for that. – tmyklebu May 22 '14 at 17:23
  • 3
    @LieRyan: Use a `double` for `n`. This also saves you form the int-to-float conversions. – tmyklebu May 22 '14 at 17:27
  • @LieRyan I don't see why it "will almost certainly ... when `values.size()` is big", `unsigned long long n` is enough for **18,446,744,073,709,551,614** (or 1.84 *10e19) elements in which case you probably have other problems already. – AliciaBytes May 22 '14 at 18:24
  • 1
    @LieRyan: That's patently false, and for some reason my last comment got deleted. – tmyklebu May 22 '14 at 18:54
  • 1
    This can overflow. Suppose `data={x, -x}`. If `x` is large enough, the second iteration of your loop will overflow when computing `value - mean`. Another way this can go wrong is when `data` has many elements. After a while dividing by `n` will mean a huge loss of precision. – a06e Aug 04 '15 at 21:02
9

Lots of stupid things can happen here. One problem is the overflow thing. Another is exemplified by this: (1e100 + 1) - 1e100) == 0. Another is just accumulated roundoff.

Kahan summation handles accumulated roundoff very well for well-scaled data. Find the sum using Kahan summation then divide by the number of data.

To deal with poorly-scaled data, you might bucket the data by exponent (say 50 different buckets each covering about 20 different exponents) and Kahan-sum in descending bucket order.

This is all massive overkill, of course, and it's rather slow. In practice, using vector instructions and stuff like that helps with speed and with precision.

tmyklebu
  • 13,915
  • 3
  • 28
  • 57
4

If you're willing to muck about with values in the process, a simple and robust scheme is to first sort it by magnitude:

struct fabs_less
{
    bool
    operator()(const double x0, const double x1) const
    {
        return fabs(x0)<fabs(x1);
    }
};

std::sort(values.begin(), values.end(), fabs_less());
const double sum = std::accumulate(values.begin(), values.end(), 0.0);
const double mean = sum / double(values.size());

This increases the computational complexity to N log N but results in the minimum possible rounding error.

Edit: tmyklebu makes a very good point with a degenerate case (curses that I missed it). Instead accumulate the negative and positive terms seperately in order of increasing magnitude:

std::sort(values.begin(), values.end());
std::vector<double>::const_iterator mid = std::upper_bound(values.begin(), values.end(), 0.0);
std::reverse_iterator<std::vector<double>::const_iterator> rmid(mid);
const double neg = std::accumulate(rmid, values.rend(), 0.0);
const double pos = std::accumulate(mid, values.end(), 0.0);
const double mean = (neg+pos) / double(values.size());

This introduces the possibility of cancellation error in neg+pos, but will still have a small error relative to the sum of the absolute values of the elements of values which I think is the best that you can hope for without some seriously complicated logic...

thus spake a.k.
  • 1,607
  • 12
  • 12
  • 2
    I don't believe it does. Try adding up a bunch of `1`s and a bunch of `-1 - 0x1p-52`'s with this scheme. – tmyklebu May 22 '14 at 21:19
2

Generally, a divide and conquer technique (a recursive split in two parts) is robust.

See my answer to Precise sum of floating point numbers where I demonstrate it with a recursive form.

Note that there is no recursive tail call elimination in C/C++, so this implementation is not necessarily efficient (it leads to a deep stack).

Community
  • 1
  • 1
aka.nice
  • 9,100
  • 1
  • 28
  • 40
  • 1
    There is most certainly recursive tail call elimination in C and C++, though your `sumpairs` does not fully qualify for recursive tail call elimination since it calls itself twice. – Mooing Duck May 22 '14 at 21:45
1

Pardon, not doing this as comment due to length. A double value value usually has more than 50 bits of precision. You're talking about 1 part in a trillion or more.

The resolution of a floating point number remains the same on a fractional basis throughout its range.

But, if you add 1234E40 to 1234E-040 you're going to get 1234E40. Adding values of different orders of magnitude will through an average off. However, the amount it will be off is usually so small (trillionths) that it is rarely noticeable.

In nearly all cases, you can do an average simply by adding and dividing by the count and get a very precise answer.

You might even be able to do a long double on your systems.

If you have some data set where this is not the case, maybe you can describe that data set and the problems it presents. From that, we could come up with a solution to your particular problem.

user3344003
  • 20,574
  • 3
  • 26
  • 62
  • 2
    The problem is that you may have so many small numbers that their sum becomes significant in relation to the larger numbers. (If you're off by a trillionth a quintillion times, it may well be significant.) – molbdnilo May 22 '14 at 21:15
  • That does not happen in most circumstances. If the OP stated that he had data of that nature, you could incur the overhead of adjusting. In most measurements, one does not encounter wide variations that cause that problem. – user3344003 May 23 '14 at 13:37