19

Suppose we have N numbers(integers, floats, whatever you want) and want to find their arithmetic mean. Simplest method is to sum all values and divide by number of values:

def simple_mean(array[N]): # pseudocode
    sum = 0
    for i = 1 to N
       sum += array[i]
    return sum / N

It works fine, but requires big integers. If we don't want big integers and we are fine with rounding errors, and N is the power of two, we can use 'divide-and-conquer' : ((a+b)/2 + (c+d)/2)/2 = (a+b+c+d)/4, ((a+b+c+d)/4 + (e+f+g+h)/4)/2 = (a+b+c+d+e+f+g+h)/8, so on.

def bisection_average(array[N]):
   if N == 1: return array[1]
   return (bisection_average(array[:N/2])+bisection_average(array[N/2:]))/2

Any other ways?

PS. playground for lazy

Daniel A. White
  • 187,200
  • 47
  • 362
  • 445
maykeye
  • 1,286
  • 1
  • 12
  • 16
  • Interesting, but that bit about 'fine with rounding errors' has got me worried. I'd prefer a method with NO errors. – pavium Aug 28 '09 at 13:01
  • On second thoughts, I'll come back to this in the morning and undelete my answer if I'm still happy that it's not vastly wrong... – Matthew Scharley Aug 28 '09 at 13:08
  • @pavium: if you want a method with NO errors, you need to calculate this by hand. – MusiGenesis Aug 28 '09 at 13:12
  • 1
    @MusiGenesis -- heh, if I calculated this by hand myself it would guarantee K errors where K >> 1 and K/N probably approaches some constant like 1/e or something. ;) – Jason S Aug 28 '09 at 13:38

7 Answers7

30

Knuth lists the following method for calculating mean and standard deviation given floating point (original on p. 232 of Vol 2 of The Art of Computer Programming, 1998 edition; my adaptation below avoids special-casing the first iteration):

double M=0, S=0;

for (int i = 0; i < N; ++i)
{
    double Mprev = M;
    M += (x[i] - M)/(i+1);
    S += (x[i] - M)*(x[i] - Mprev);
}

// mean = M
// std dev = sqrt(S/N) or sqrt(S/N+1)
// depending on whether you want population or sample std dev
Jason S
  • 184,598
  • 164
  • 608
  • 970
17

Here's a way to calculate the mean using just integers without rounding errors and avoiding big intermediate values:

sum = 0
rest = 0
for num in numbers:
  sum += num / N
  rest += num % N
  sum += rest / N
  rest = rest % N

return sum, rest
sepp2k
  • 363,768
  • 54
  • 674
  • 675
  • This basically uses multiprecision (dual word) arithmetic. I think there's a way to optimize this to get the number of divide-like (/ or %) operations down, but I can't remember off the top of my head. – Jason S Aug 28 '09 at 13:37
  • The usual technique is to calculate X/N and X%N in a single function/single operation. This is because the underlying algorithms are pretty much the same. – MSalters Aug 28 '09 at 14:14
  • yes although C doesn't expose them. >:( No, I meant more like: sum += (num + rest) / N; rest = (num + rest) % N; except that may be prone to overflow – Jason S Aug 28 '09 at 17:04
  • 2
    C doesn't expose div/mod with a single function or operator, but most good compilers will generate the instruction/call if you use both. – Stephen Canon Aug 28 '09 at 17:34
  • Just a minor improvement: Since `rest` can never get bigger than `2*(N-1)`, you could replace the second division/modulo operations by a simple `if(rest > N)` branch (it is of course platform dependent if branches or divisions are faster, or if the compiler is smart enough to see this itself). Actually, to also avoid overflows (if `N>MAXINT/2`) you can subtract `N` from `rest` everytime `rest` gets positive (and after the loop decide if you want to round `sum` up, depending on `rest`) – chtz Sep 26 '16 at 15:53
  • 2
    @Stephen: _"C doesn't expose div/mod with a single function or operator"_ - unless we count [`div`](http://en.cppreference.com/w/c/numeric/math/div), which does exactly that - although not for floats – Eric Nov 25 '16 at 13:07
3

If the big integers are problem... is it ok

a/N + b/N+.... n/N

I mean you're looking just for other ways or the optimal way?

Svetlozar Angelov
  • 21,214
  • 6
  • 62
  • 67
  • 2
    why?!?! If a,b,etc are integers this will give you an incorrect answer. If they are floating point, I'm not sure but my hunch is you will get more rounding errors than if you just performed a sum and then divided. In either case the computation time is increased greatly for a questionable benefit. – Jason S Aug 28 '09 at 13:35
3

If the array is floating-point data, even the "simple" algorithm suffers from rounding error. Interestingly, in that case, blocking the computation into sqrt(N) sums of length sqrt(N) actually reduces the error in the average case (even though the same number of floating-point roundings are performed).

For integer data, note that you don't need general "big integers"; if you have fewer than 4 billion elements in your array (likely), you only need an integer type 32 bits larger than that the type of the array data. Performing addition on this slightly larger type will pretty much always be faster than doing division or modulus on the type itself. For example, on most 32 bit systems, 64-bit addition is faster than 32-bit division/modulus. This effect will only become more exaggerated as the types become larger.

Stephen Canon
  • 103,815
  • 19
  • 183
  • 269
1

If you use float you might avoid big integers:

def simple_mean(array[N]):
    sum = 0.0 # <---
    for i = 1 to N
       sum += array[i]
    return sum / N
Nick Dandoulakis
  • 42,588
  • 16
  • 104
  • 136
0

The Kahan algorithm (according to the wikipedia) has better runtime performance (than the pairwise summation) -O(n)- and an O(1) error growth:

function KahanSum(input)
    var sum = 0.0
    var c = 0.0                  // A running compensation for lost low-order bits.
    for i = 1 to input.length do
        var y = input[i] - c     // So far, so good: c is zero.
        var t = sum + y          // Alas, sum is big, y small, so low-order digits of y are lost.
        c = (t - sum) - y // (t - sum) recovers the high-order part of y; subtracting y recovers -(low part of y)
        sum = t           // Algebraically, c should always be zero. Beware overly-aggressive optimizing compilers!
        // Next time around, the lost low part will be added to y in a fresh attempt.
    return sum

Its idea is that the low bits of the floating point numbers are summed and corrected independently from the main summation.

Gábor Bakos
  • 8,982
  • 52
  • 35
  • 52
0

Building on Jason S's solution to find a weighted-average and reign back in S's growth.

Using the M finding algorithm given before along with the aggregate formulas for weighted average and population standard deviation:

  • Avg = Avg(W*X) / Avg(W)
  • StDev = sqrt(Avg(W*X*X) / Avg(W) - Avg*Avg)

rewrite the code to find the three running averages, then do then the aggregate calculations at the end

function GetPopulationStats{
<#
.SYNOPSIS
Calculate the average, variance, and standard deviation of a weighted data set

.DESCRIPTION
Uses the Knuth method for finding means adapted by Jason S, to calculate the 
three averages required by the agregate statistical formulas

.LINK
https://stackoverflow.com/a/1346890/4496560
#>
   param(
      [decimal[]]$x #Data Points
     ,[decimal[]]$w #Weights
   )
   
   $N = $x.Length
   
   [decimal]$AvgW   = 0
   [decimal]$AvgWX  = 0
   [decimal]$AvgWXX = 0

   for($i=0; $i -lt $N; $i++){
      $AvgW   += ($w[$i]               - $AvgW)   / ($i+1)
      $AvgWX  += ($w[$i]*$x[$i]        - $AvgWX)  / ($i+1)
      $AvgWXX += ($w[$i]*$x[$i]*$x[$i] - $AvgWXX) / ($i+1)
   }

   [ordered]@{
      N     = $N
      Avg   = ($avg = $AvgWX / $AvgW)
      Var   = ($var = ($AvgWXX / $AvgW) - ($Avg * $Avg))
      StDev = SquareRoot $var
   }
}

and then if your language is like Windows PowerShell you'll likely need a higher precision [math]::sqrt() function

function SquareRoot([decimal]$a){
<#
.SYNOPSIS
Find the square-root of $a

.DESCRIPTION
Uses the McDougall-Wotherspoon variant of the Newton-Raphson method to 
find the positive zero of:
   f(x)  = (x * x) - a
   f'(x) = 2 * x

.NOTES
ToDo: using a fitted polynomial would likely run faster
#>
   $BiCycleX = $PrevX = 0; 
   $x  = $a/2 # guess
   $y  = ($x * $x) - $a
   $xx = $x
   $m  = $x + $xx

   $del = $x - $PrevX
   if($del -lt 0){ $del = -$del }
   
   $i = 0
   while($del -gt 0 -and $x -ne $BiCycleX){
      $BiCycleX = $PrevX;
      $PrevX    = $x;
      $x  = $x - ($y / $m)
      $y  = ($x * $x) - $a
      $xx = $x - ($y / $m)
      $m  = $x + $xx

      $del = $x - $PrevX
      if($del -lt 0){ $del = -$del }
      
      if(++$i -ge 50){
         throw ("invariant sanity fail on step {0}:`r`n   x_(n-1) = {1}`r`n   x_n     = {2}`r`n   delta   = {3:0.#e0}" -f $i,$PrevX,$x,$del)
      }
   }
   
   ($x + $PrevX) / 2
}

however, if you don't need a weighted solution it should be easy enough to just let w[i]=1 for all i

finally, it doesn't hurt to do a quick sanity check on the code

describe 'tool spot-check' {
   context 'testing root calcs' {
      $TestCases = @(
         @{Value = 0; Expected = 0}
         @{Value = 1; Expected = 1}
         @{Value = 4; Expected = 2}
         @{Value = 9; Expected = 3}
         @{Value = 2; Expected = [decimal]'1.4142135623730950488016887242'}
         @{Value = (1e14-1); Expected = [decimal]'9999999.99999995'}
      )
      It 'finds the square root of: <Value>' -TestCases $TestCases {
         param($Value,$Expected)
         SquareRoot $Value | should be $Expected
      }
   }
   
   context 'testing stat calcs' {
      It 'calculates the values for 1 to 1000' {
         $x = 1..1000
         $w = @(1) * 1000
         
         $results = GetPopulationStats $x $w
         $results.N     | should be 1000
         $results.Avg   | should be 500.5
         $results.Var   | should be 83333.25
         $results.StDev | should be ([decimal]'288.67499025720950043826670416')
      }

      It 'calculates the values for a test data set' {
         $x = @(33,119,37,90,50,94,32,147,86,28,50,80,145,131,121,90,140,170,214,70,124)
         $w = @(207,139,25,144,72,162,93,91,109,151,125,87,49,99,210,105,99,169,50,59,22)
         
         $results = GetPopulationStats $x $w
         $results.N     | should be 21
         $results.Avg   | should be ([decimal]'94.54433171592412880458756066')
         $results.Var   | should be ([decimal]'2202.659150711314347179152603')
         $results.StDev | should be ([decimal]'46.93249567955356821948311637')
      }
   }
}
Gregor y
  • 1,762
  • 15
  • 22