4

I was comparing two algorithms computing the average of random numbers.

  • First algorithm sums all numbers and divides by the items count in the end
  • Second algorithm computes the average on every iteration and reuses the result when new data is received

I suppose there's nothing revolutionary here, and I'm not a mathematician so I can't put a name on those two algorithms.

Here is my code:

#include <iostream>
#include <iomanip>
#include <cstdlib>

class Average1
{
public:
    Average1() : total( 0 ), count( 0 ) {}

    void add( double value )
    {
        total += value;
        count++;
    }

    double average()
    {
        return total/count;
    }

private:
    double total;
    size_t count;
};

class Average2
{
public:
    Average2() : av( 0 ), count( 0 ) {}

    void add( double value )
    {
        av = (av*count + value)/(count+1);
        count++;
    }

    double average()
    {
        return av;
    }

private:
    double av;
    size_t count;
};

void compare()
{
    Average1 av1;
    Average2 av2;
    double temp;
    for ( size_t i = 0; i != 100000000; ++i )
    {
        temp = static_cast<double>(std::rand()) / static_cast<double>(RAND_MAX);
        av1.add( temp );
        av2.add( temp );
    }

    std::cout << std::setprecision(20) << av1.average() << std::endl;
    std::cout << std::setprecision(20) << av2.average() << std::endl;
}

int main()
{
    compare();
    return 0;
}

Output is:

0.50001084285722707801
0.50001084285744978875

The difference is certainly due to double type precision.

In the end, which one is the good method? Which one gives the real mathematical average (or closest to...)?

jpo38
  • 20,821
  • 10
  • 70
  • 151

4 Answers4

8

If you really want high-precision:

Edit: the python-docs in math.fsum also links to this Overview of approaches

sascha
  • 32,238
  • 6
  • 68
  • 110
3

My guess is that the first class gives a more reliable result. In the second case at each iteration you do some approximation due to the division by count and eventually all these approximations add up to the difference in result that you see. In the first case, instead, you just approximate when you compute the final division.

lupod
  • 144
  • 1
  • 11
  • Not necessarily true. You can lose a lot of precision adding a list of numbers if at some point the cumulative sum is very close to the negative of the next number. – rici May 25 '16 at 20:06
  • Yes, but that holds also in the case of the second class, if av*count happens to be close to -value. In this specific case I would go for the first class because it avoids dividing at each addition of a new number. However, I am sure that there are ways to prevent also the problem that you mention. – lupod May 25 '16 at 20:20
  • Yes, the second one is no better. The classic stable algorithm is `newmean = oldmean + (newvalue - oldmean)/newcount`. Even though it involves a division on every iteration, it is more stable than the naive summation algorithm. (And roughly on par with Kahan summation, I believe.) – rici May 25 '16 at 20:32
3

John D. Cook gives a great analysis he recommends:

av = av + (value - av)/count;

His posts start with Comparing three methods of computing standard deviation.

Then Theoretical explanation for numerical results

and last Accurately computing running variance

Jacob
  • 1,423
  • 16
  • 29
0

My own thinking is that both compute count times value which is a large number before dividing it, which explains why your result is approximate. I would do:

class Average3
{
public:
    Average3() : av( 0 ), count( 0 ) {}

    void add( double value )
    {
        count++;
        av +=  (value - av)/count;
    }
...

But you still loose precision when adding the last numbers because adding value/count is small compared to average. I'd be glad to know if my intuition was right though

Diane M
  • 1,503
  • 1
  • 12
  • 23