Plot 95% confidence interval errorbar python pandas dataframes

Question

I want to show 95% confidence interval with Python pandas, matpolib... But I stucked, because for usual .std() I would do smth like this:

import pandas as pd
import numpy as np

import matplotlib

matplotlib.use('Agg')

import matplotlib.pyplot as plt
import math

data = pd.read_table('output.txt',sep=r'\,', engine='python')
Ox = data.groupby(['Ox'])['Ox'].mean()
Oy = data.groupby(['Ox'])['Oy'].mean()
std = data.groupby(['Ox'])['Oy'].std()

plt.plot(Ox, Oy , label = 'STA = '+ str(x))
plt.errorbar(Ox, Oy, std, label = 'errorbar', linewidth=2)

plt.legend(loc='best', prop={'size':9.2})

plt.savefig('plot.pdf')
plt.close()

But I haven't found something in pandas methods which can help me. Does anybody know?

You could usw either 2*std, as two simga is approx 95% vor use the pandas quantile methid to calculate the 0.025 and 0.975 quantile. — MaxNoe, Jun 17 '17 at 10:49

MaxNoe · Accepted Answer · 2018-12-28T18:34:42.363

Using 2 * std to estimate the 95 % interval

In a normal distribution, the interval [μ - 2σ, μ + 2σ] covers 95.5 %, so you can use 2 * std to estimate the 95 % interval:

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

df = pd.DataFrame()
df['category'] = np.random.choice(np.arange(10), 1000, replace=True)
df['number'] = np.random.normal(df['category'], 1)

mean = df.groupby('category')['number'].mean()
std = df.groupby('category')['number'].std()

plt.errorbar(mean.index, mean, xerr=0.5, yerr=2*std, linestyle='')
plt.show()

Result:

Using percentiles

If your distribution is skewed, it is better to use asymmetrical errorbars and get your 95% interval from the percentiles.

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from scipy.stats import skewnorm

df = pd.DataFrame()
df['category'] = np.random.choice(np.arange(10), 1000, replace=True)
df['number'] = skewnorm.rvs(5, df['category'], 1)

mean = df.groupby('category')['number'].mean()
p025 = df.groupby('category')['number'].quantile(0.025)
p975 = df.groupby('category')['number'].quantile(0.975)

plt.errorbar(
    mean.index,
    mean,
    xerr=0.5,
    yerr=[mean - p025, p975 - mean],
    linestyle='',
)
plt.show()

Result:

Note that yerr has to be index from 0 for some reason in 0.19.2. — mirgee, Dec 01 '17 at 09:50
Isn't the 95% confidence interval of a normal distribution the standard error of the mean (SEM) at 95% confidence? That is ~ 2 sigma/root(N) where n is the sample count. I never plot the 2sigma as the 95% CI. Obviously, this is not what this post is about but I feel is strange to recommend plotting 2 sigma as error bars. — n3rd, Sep 03 '20 at 02:09
I think you are confusing several things here. The 95 % confidence interval for any value of a normal distribution is just that, the interval in which 95 % of the values land. What you seem to be conflating with is the 95 % confidence interval for a mean calculated from a sample. — MaxNoe, Sep 04 '20 at 12:48

score 8 · Answer 2 · answered Jun 17 '17 at 11:09

For a normal distribution ~95% of the values lie within a window of 4 standard deviations around the mean, or in other words, 95% of the values are within plus/minus 2 standard deviations from the mean. See, e.g. 68–95–99.7-rule.

plt.errorbar's yerr argument specifies the length of the single sided errorbar. Thus taking

plt.errorbar(x,y,yerr=2*std)

where std is the standard deviation shows the errorbars of the 95% confidence interval.

The link you posted helps a lot. – GoingMyWay Feb 21 '19 at 02:07 — GoingMyWay, Feb 21 '19 at 02:07

score 1 · Answer 3 · answered Apr 08 '21 at 18:24

To obtain the 95% confidence interval you'll need to define a function.

Tweaking: Compute a confidence interval from sample data

def mean_confidence_interval(data, confidence=0.95):
    a = 1.0 * np.array(data)
    n = len(a)
    m, se = np.mean(a), scipy.stats.sem(a)
    h = se * scipy.stats.t.ppf((1 + confidence) / 2., n-1)
    return m

def bound_confidence_interval(data, confidence=0.95):
    a = 1.0 * np.array(data)
    n = len(a)
    m, se = np.mean(a), scipy.stats.sem(a)
    h = se * scipy.stats.t.ppf((1 + confidence) / 2., n-1)
    return h

Then define:

mean = df.groupby(by='yourquery').agg(mean_confidence_interval)
bound = df.groupby(by='yourquery').agg(bound_confidence_interval)

Finally plot with a library of your choice: e.g. plotly

import plotly.graph_objects as go

   fig = go.Figure(data=go.Scatter(
        x=mean[yourquery],
        y=mean[yourquery2],
        error_y=dict(
            type='data', # value of error bar given in data coordinates
            array=bound[yourquery2],
            visible=True)
    ))
    fig.show()

Plot 95% confidence interval errorbar python pandas dataframes

3 Answers3

Using 2 * std to estimate the 95 % interval

Result:

Using percentiles

Result: