-1

This is an example dataframe, my actual dataframe has 100s more rows.

nums_1  nums_2  nums_3
1       1       8
2       1       7
3       5       9

Is there a method that will calculate the 95% confidence interval across each row? A method that would work for large dataframe?

df = pd.DataFrame({'nums_1': [1, 2, 3], 'nums_2': [1, 1, 5], 'nums_3' : [8,7,9]})
I'mahdi
  • 23,382
  • 5
  • 22
  • 30
Niam45
  • 552
  • 2
  • 16

1 Answers1

0

You can use:

from scipy import stats

df.apply(lambda x: stats.t.interval(0.95, len(x)-1, loc=np.mean(x), scale=stats.sem(x)), axis=1)

You will obtain essentially the same results by using the following:

import statsmodels.stats.api as sms

df.apply(lambda x: sms.DescrStatsW(x).tconfint_mean(), axis=1)

Both answers return the same result - tuples.

The answer is described here: Compute a confidence interval from sample data What is important to understand is that it works correctly if each row (each sample) is drawn independently from a normal distribution with an unknown standard deviation.

When it comes to large dataframes, the easy solution is to use swifter. However, it only speeds up your calculations twice. Nevertheless, it is worth trying: https://towardsdatascience.com/do-you-use-apply-in-pandas-there-is-a-600x-faster-way-d2497facfa66

import statsmodels.stats.api as SMS
import swifter

df.swifter.apply(lambda x: sms.DescrStatsW(x).tconfint_mean(), axis=1)

Edit: if you want to round your results and maybe get two columns instead of one with tuples, you can use:

def get_conf_interv(x):
    res1, res2 = sms.DescrStatsW(x).tconfint_mean()
    return round(res1, 2), round(res2, 2)

df[['res1', 'res2']] = df.swifter.apply(get_conf_interv, axis=1, result_type='expand')
jack
  • 36
  • 3