My goal is to compute the mean and standard deviation of different samples. For example, I have a df:
Start End n Mean Dev
abc x 54 47 5
abc x 45 42 4
abc y 14 50 10
abc y 16 30 20
ijk x 25 20 5
ijk x 25 20 5
ijk z 7 10 2
Output:
Start End N CombinedMean CombinedDev
abc x 99 44.7 5.2
abc y 30 39.3 19
ijk x 50 20 5
ijk z 7 10 2
Normally I would use groupby and agg({'N': 'sum', 'Mean':'mean', 'Dev':'mean'}) but this is mathematically incorrect. To compute the combined mean and deviation in this case I have to use:
CombinedMean = sum(n*mean)/Sum(n)
CombinedDev = sqrt[ sum(n*(dev^2+mean^2))/sum(n) - (combined mean)^2]
For example,
N = n1+n2 = 54+45 = 99
Sum = n1*mean1 + n2*mean2 = 54*47+45*42 = 4,428
The combined mean is Sum / N = 4,428 / 99 = 44.7
Sum of Squares = n1*(sd1^2 + mean1^2) + n2*(sd2^2 + mean2^2) = 54*(5^2 + 47^2) + 45*(4^2 + 42^2) = 200,736
So the combined standard deviation is SQRT(Sum of Squares / N - Mean^2) = SQRT(200,739/99 - 44.7^2) = 5.2
Long question short: how can I implement the 2 formulas in my code? Many thanks! My apology for the lengthy question :)