Pandas computing average mean and stdev

Question

My goal is to compute the mean and standard deviation of different samples. For example, I have a df:

Start   End   n     Mean   Dev
abc     x     54     47    5
abc     x     45     42    4
abc     y     14     50    10
abc     y     16     30    20
ijk     x     25     20    5
ijk     x     25     20    5
ijk     z     7      10    2

Output:

Start   End   N     CombinedMean   CombinedDev
abc     x     99    44.7           5.2
abc     y     30    39.3           19
ijk     x     50    20             5
ijk     z     7     10             2

Normally I would use groupby and agg({'N': 'sum', 'Mean':'mean', 'Dev':'mean'}) but this is mathematically incorrect. To compute the combined mean and deviation in this case I have to use:

CombinedMean = sum(n*mean)/Sum(n)

CombinedDev = sqrt[ sum(n*(dev^2+mean^2))/sum(n) - (combined mean)^2]

For example,

N = n1+n2 = 54+45 = 99

Sum = n1*mean1 + n2*mean2 = 54*47+45*42 = 4,428

The combined mean is Sum / N = 4,428 / 99 = 44.7

Sum of Squares = n1*(sd1^2 + mean1^2) + n2*(sd2^2 + mean2^2) = 54*(5^2 + 47^2) + 45*(4^2 + 42^2) = 200,736

So the combined standard deviation is SQRT(Sum of Squares / N - Mean^2) = SQRT(200,739/99 - 44.7^2) = 5.2

Long question short: how can I implement the 2 formulas in my code? Many thanks! My apology for the lengthy question :)

You need groupby here , def your own function which include those calculation you need , then with df.groupby ().agg() — BENY, Mar 27 '18 at 21:23
@TylerNG, have a go at using `lambda`. There are examples in the marked duplicate. — jpp, Mar 27 '18 at 23:02

Pandas computing average mean and stdev

0 Answers0