0

I'm trying to find the 95% ci for the linear regression model of a set of data. Data are provided in the code below. I'm trying to do so using scipy.linregress, then I moved on as scipy doesn't provide a useful method to calculate the 95% CI so I came across this question where I read that seaborn uses statsmodels under the hood. I tried both method and I was surprised to see different results. I let here the code per reproducibility.

import matplotlib.pyplot as plt
import pandas as pd
from statsmodels import api as sm
import seaborn as sns
from scipy import stats


df = pd.DataFrame(
    [
        [371.0, 577.0],
        [621.0, 493.0],
        [697.0, 356.0],
        [762.0, 386.0],
        [820.0, 283.0],
        [832.0, 354.0],
        [849.0, 540.0],
        [864.0, 370.0],
        [916.0, 284.0],
        [949.0, 95.0],
        [970.0, 145.0],
        [1051.0, 95.0],
        [1101.0, 45.0],
        [1190.0, 45.0],
    ],
    index=[
        2014,
        2015,
        2016,
        2017,
        2011,
        2018,
        2013,
        2012,
        2019,
        2009,
        2010,
        2008,
        2007,
        2006,
    ],
    columns=['x', 'y'],
)

print(df)

lr = stats.linregress(df['x'], df['y'])
# lr.slope, lr.intercept, lr.rvalue, lr.pvalue, lr.stderr, lr.intercept_stderr
tinv = lambda p, df: abs(stats.t.ppf(p / 2, df))
ts = tinv(0.05, len(df.index) - 2)

print(f'pvalue: {lr.pvalue:6f}.')
print(f'R-squared: {lr.rvalue**2:.6f}.')
print(
    f'slope: {lr.slope:.3f}, 95%CI[{lr.slope - ts * lr.stderr:.3f} - {lr.slope + ts * lr.stderr:.3f}].'
)
print(
    f'intercept: {lr.intercept:.3f}, 95%CI[{lr.intercept - ts * lr.intercept_stderr:.3f} - {lr.intercept + ts * lr.intercept_stderr:.3f}].'
)


plt.scatter(df['x'], df['y'], label='data')
plt.plot(df['x'], df['x'] * lr.slope + lr.intercept, c='blue', label='lr model')
plt.legend()
plt.title('with scipy')
plt.show()

sns.lmplot(x='x', y='y', data=df, fit_reg=True, ci=95, n_boot=1000)
plt.title('with sns')
plt.show()


X = sm.add_constant(df['x'])
ols_model = sm.OLS(df['y'], X)
est = ols_model.fit()
print(est.summary())
predictions = est.get_prediction(X).summary_frame()

plt.scatter(df['x'], df['y'], label='data')
plt.plot(df['x'][predictions.index], predictions['mean'], c='blue', label='lr model')
plt.fill_between(
    df['x'][predictions.index],
    predictions['mean_ci_lower'],
    predictions['mean_ci_upper'],
    color='blue',
    alpha=0.3,
    label='lr model CI 95%',
)
plt.fill_between(
    df['x'][predictions.index],
    predictions['obs_ci_lower'],
    predictions['obs_ci_upper'],
    color='green',
    alpha=0.3,
    label='prediction_interval CI 95%',
)

plt.legend()
plt.title('with statsmodels')
plt.show()

sns.lmplot(x='x', y='y', data=df, fit_reg=True, ci=95, n_boot=1000)
plt.plot(df['x'][predictions.index], predictions['mean'], c='blue', label='lr model')
plt.fill_between(
    df['x'][predictions.index],
    predictions['mean_ci_lower'],
    predictions['mean_ci_upper'],
    color='blue',
    alpha=0.1,
    label='lr model CI 95%',
)

plt.title('sns + statsmodels')
plt.legend()
plt.show()

This code generates 4 plots: first one with linregress, second one with sns, third one with statsmodels and the lastone is the sns + statsmodel, which I post also here.

fig4_v2

sns output is the greenone, while the blueone is the model obtained with statsmodels.

Why are the plots obtained with sns and statsmodels different?

Zeno Dalla Valle
  • 957
  • 5
  • 16
  • alpha=0.1 versus ci=0.95 ? – Josef Dec 05 '22 at 17:56
  • also, AFAICS, you use bootstrap intervals for seaborn, while statsmodels only uses asymptotic intervals. – Josef Dec 05 '22 at 17:58
  • So the difference is due to the different calculation of asymptotic intervals? – Zeno Dalla Valle Dec 05 '22 at 18:08
  • The default alpha for OLSResults is 0.05 https://www.statsmodels.org/dev/generated/statsmodels.regression.linear_model.OLSResults.conf_int.html – Zeno Dalla Valle Dec 05 '22 at 18:10
  • sorry, (I don't like greek letters), I mixed up alpha in fill_between with alpha for confints. – Josef Dec 05 '22 at 20:33
  • 1
    bootstrap does not rely on asymptotic distributional assumptions. So the two methods will produce different results unless the sample size is large (and distribution is approximately normal) – Josef Dec 05 '22 at 20:35

0 Answers0