12

I am trying to calculate the mean and confidence interval(95%) of a column "Force" in a large dataset. I need the result by using the groupby function by grouping different "Classes".

When I calculate the mean and put it in the new dataframe, it gives me NaN values for all rows. I'm not sure if I'm going the correct way. Is there any easier way to do this?

This is the sample dataframe:

df=pd.DataFrame({ 'Class': ['A1','A1','A1','A2','A3','A3'], 
                  'Force': [50,150,100,120,140,160] },
                   columns=['Class', 'Force'])

To calculate the confidence interval, the first step I did was to calculate the mean. This is what I used:

F1_Mean = df.groupby(['Class'])['Force'].mean()

This gave me NaN values for all rows.

yoonghm
  • 4,198
  • 1
  • 32
  • 48
MasterShifu
  • 213
  • 1
  • 2
  • 16

5 Answers5

31

Update on 25-Oct-2021: @a-donda pointed out, 95% shall be based on 1.96 X standard deviations of the mean.

import pandas as pd
import numpy as np
import math

df=pd.DataFrame({'Class': ['A1','A1','A1','A2','A3','A3'], 
                 'Force': [50,150,100,120,140,160] },
                 columns=['Class', 'Force'])
print(df)
print('-'*30)

stats = df.groupby(['Class'])['Force'].agg(['mean', 'count', 'std'])
print(stats)
print('-'*30)

ci95_hi = []
ci95_lo = []

for i in stats.index:
    m, c, s = stats.loc[i]
    ci95_hi.append(m + 1.96*s/math.sqrt(c))
    ci95_lo.append(m - 1.96*s/math.sqrt(c))

stats['ci95_hi'] = ci95_hi
stats['ci95_lo'] = ci95_lo
print(stats)

The output is

  Class  Force
0    A1     50
1    A1    150
2    A1    100
3    A2    120
4    A3    140
5    A3    160
------------------------------
       mean  count        std
Class                        
A1      100      3  50.000000
A2      120      1        NaN
A3      150      2  14.142136
------------------------------
       mean  count        std     ci95_hi     ci95_lo
Class                                                
A1      100      3  50.000000  156.580326   43.419674
A2      120      1        NaN         NaN         NaN
A3      150      2  14.142136  169.600000  130.400000
yoonghm
  • 4,198
  • 1
  • 32
  • 48
  • 2
    Such an outstanding answer. I wish I could award it. Returning a df with the all the stats in it is such a great practice. Well done. – autonopy Mar 12 '21 at 15:26
  • So here's my addition: to get it to return a formatted string column, add the following: `stats['95p_ci'] = "(" + stats['ci95_lo'].round(1).astype(str) + ', ' + stats['ci95_hi'].round(1).astype(str) + ')'` – autonopy Mar 12 '21 at 15:27
  • 2
    The correct multiplier for a 95% CI is 1.96, not 1.95. Also, be aware that this is based on the normal distribution approximation to the binomial distribution, and only works well for large samples. – A. Donda Oct 23 '21 at 21:35
  • @A.Donda, you are correct. Let me update – yoonghm Oct 25 '21 at 04:41
6

You can simplify @yoonghm solution by taking advantage of 'sem' which is the standard error of the mean.

import pandas as pd
import numpy as np
import math

df=pd.DataFrame({'Class': ['A1','A1','A1','A2','A3','A3'], 
                 'Force': [50,150,100,120,140,160] },
                 columns=['Class', 'Force'])
print(df)
print('-'*30)

stats = df.groupby(['Class'])['Force'].agg(['mean', 'sem'])
print(stats)
print('-'*30)


stats['ci95_hi'] = stats['mean'] + 1.96* stats['sem']
stats['ci95_lo'] = stats['mean'] - 1.96* stats['sem']
print(stats)
1

Don't mean to be a pain but the 1.96 * sd formula is a gross oversimplification and leads to bad conclusions in smaller samples. Use t distribution instead:

import pandas as pd
import scipy.stats as stats

df=pd.DataFrame({'Class': ['A1','A1','A1','A2','A3','A3'],
                 'Force': [50,150,100,120,140,160] },
                 columns=['Class', 'Force'])
print(df)
grouped = df.groupby(['Class'])['Force'].agg(['mean', 'count', 'std'])

# Calculate the t-value for a 95% confidence interval
t_value = stats.t.ppf(0.975, grouped['count'] - 1)  # 0.975 corresponds to (1 - alpha/2)
# Calculate the margin of error
me = t_value * grouped['std'] / (grouped['count'] ** 0.5)
# Calculate the lower and upper bounds of the confidence interval   
grouped['ci_low'] = grouped['mean'] - me 
grouped['ci_high'] = grouped['mean'] + me 
print(grouped)

Out =

  Class  Force
0    A1     50
1    A1    150
2    A1    100
3    A2    120
4    A3    140
5    A3    160
        mean  count        std     ci_low     ci_high
Class                                                
A1     100.0      3  50.000000 -24.206886  224.206886
A2     120.0      1        NaN        NaN         NaN
A3     150.0      2  14.142136  22.937953  277.062047

(assistance from chatgpt 3.5 acknowledged)

ErichBSchulz
  • 15,047
  • 5
  • 57
  • 61
0

As mentioned in the comments, I could not duplicate your error, but you can try to check that your numbers are stored as numbers and not as strings. use df.info() and make sure that the relevant columns are float or int:

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6 entries, 0 to 5
Data columns (total 2 columns):
Class    6 non-null object   # <--- non-number column
Force    6 non-null int64    # <--- number (int) column
dtypes: int64(1), object(1)
memory usage: 176.0+ bytes
Dror Paz
  • 387
  • 3
  • 16
0

I think the pd.Series.quantile method can be used to return confidence intervals like this:

confidence_intervals = df.groupby('Class').quantile(q=[0.025, 0.975])
print(confidence_intervals)

Output:

             Force
Class             
A1    0.025   52.5
      0.975  147.5
A2    0.025  120.0
      0.975  120.0
A3    0.025  140.5
      0.975  159.5
Bill
  • 10,323
  • 10
  • 62
  • 85