What implementation of average is the most accurate?

Question

Given those two implementations of the average function:

float average(const vector<float>& seq)
{
  float sum = 0.0f;

  for (auto&& value : seq)
  {
    sum += value;
  }

  return sum / seq.size();
}

And:

float average(const vector<float>& seq)
{
  float avg = 0.0f;

  for (auto&& value : seq)
  {
    avg += value / seq.size();
  }

  return avg;
}

To illustrate my question, imagine we have a huge difference in the input data, like so:

1.0f, 0.0f, 0.0f, 0.0f, 1000000.0f

My guess is that in the first implementation, sum can grow "too much" and loose the least significant digits and be 1000000.0f instead of 1000001.0f at the end of the sum loop.

On the other hand, the second implementation seems theorically less efficient, due to the number of divisions to perform (I haven't profiled anything, this is a blind guess).

So, is one of these implementation preferable to the other ? Am I true that the first implementation is less accurate ?

Why don't you divide once in every `n` iterations, instead of in every iteration? You can compute the approximate value of `n` depending on the values in `seq`. Once you compute `n`, you can use it in an outer loop, and modify the code accordingly. There could be other variants of this approach. — Nawaz, May 03 '13 at 08:11
@Nawaz: I'm sure I understand properly what you mean. Do you have an example ? — ereOn, May 03 '13 at 08:15
See my answer to this question: [sum of small double numbers c++](http://stackoverflow.com/q/10330002/1084416). It uses Kahan Summation as @James suggested. — Peter Wood, May 03 '13 at 08:31
I'm not really sure why this was downvoted, but could you explain so that I have a chance for improvement ? (I never downvote, check my stats) — ereOn, May 03 '13 at 14:29

score 5 · Accepted Answer · edited May 23 '17 at 11:49

I wouldn't count on the second being more accurate. The differences in the size of the elements are divided by the length of the vector, but each division introduces some additional imprecision.

If accuracy is a problem, the first step should be to use double. Even if the vector is float, for memory reasons, the calculations within the function should be double.

Beyond that, for large numbers of elements, you should probably use the Kahan algorithm, rather than just naïvely adding the elements. Although it adds a number of operations in the loop, it keeps track of the error, and will result in significantly more accuracy.

EDIT:

Just for the fun of it, I wrote a small program which used the following code to generate the vector:

std::vector<float> v;
v.push_back( 10000000.0f );
for ( int count = 10000000; count > 0; -- count ) {
    v.push_back( 0.1f );
}

The results of the average should be 1.0999999 (practically speaking, 1.1). Using either of the algorithms in the original posting, the results are 0.999999881: an error of 10%. Just changing sum to have type double in the first algorithm, however, results in 1.0999999, about as accurate as you can get. Using the Kahan algorithm (with float everywhere) gives the same results.

Thank you very much. I took the liberty of adding a link to @PeterWood answer, that gives an implementation of the algorithm. — ereOn, May 03 '13 at 08:41
I especially like Peter Wood's implementation: defining a KahanAccumulator and using `std::accumulate`. Simple and elegant. — James Kanze, May 03 '13 at 09:00

score 0 · Answer 2 · answered May 03 '13 at 08:15

0

If your summation is not too big for type float, the first one might be more accurate because individual round-up errors produced by division may accumulate

answered May 03 '13 at 08:15

fatihk

7,789
1
26
48

What implementation of average is the most accurate?

2 Answers2

EDIT: