I'm trying to find the 95% ci for the linear regression model of a set of data. Data are provided in the code below. I'm trying to do so using scipy.linregress, then I moved on as scipy doesn't provide a useful method to calculate the 95% CI so I came across this question where I read that seaborn uses statsmodels under the hood. I tried both method and I was surprised to see different results. I let here the code per reproducibility.
import matplotlib.pyplot as plt
import pandas as pd
from statsmodels import api as sm
import seaborn as sns
from scipy import stats
df = pd.DataFrame(
[
[371.0, 577.0],
[621.0, 493.0],
[697.0, 356.0],
[762.0, 386.0],
[820.0, 283.0],
[832.0, 354.0],
[849.0, 540.0],
[864.0, 370.0],
[916.0, 284.0],
[949.0, 95.0],
[970.0, 145.0],
[1051.0, 95.0],
[1101.0, 45.0],
[1190.0, 45.0],
],
index=[
2014,
2015,
2016,
2017,
2011,
2018,
2013,
2012,
2019,
2009,
2010,
2008,
2007,
2006,
],
columns=['x', 'y'],
)
print(df)
lr = stats.linregress(df['x'], df['y'])
# lr.slope, lr.intercept, lr.rvalue, lr.pvalue, lr.stderr, lr.intercept_stderr
tinv = lambda p, df: abs(stats.t.ppf(p / 2, df))
ts = tinv(0.05, len(df.index) - 2)
print(f'pvalue: {lr.pvalue:6f}.')
print(f'R-squared: {lr.rvalue**2:.6f}.')
print(
f'slope: {lr.slope:.3f}, 95%CI[{lr.slope - ts * lr.stderr:.3f} - {lr.slope + ts * lr.stderr:.3f}].'
)
print(
f'intercept: {lr.intercept:.3f}, 95%CI[{lr.intercept - ts * lr.intercept_stderr:.3f} - {lr.intercept + ts * lr.intercept_stderr:.3f}].'
)
plt.scatter(df['x'], df['y'], label='data')
plt.plot(df['x'], df['x'] * lr.slope + lr.intercept, c='blue', label='lr model')
plt.legend()
plt.title('with scipy')
plt.show()
sns.lmplot(x='x', y='y', data=df, fit_reg=True, ci=95, n_boot=1000)
plt.title('with sns')
plt.show()
X = sm.add_constant(df['x'])
ols_model = sm.OLS(df['y'], X)
est = ols_model.fit()
print(est.summary())
predictions = est.get_prediction(X).summary_frame()
plt.scatter(df['x'], df['y'], label='data')
plt.plot(df['x'][predictions.index], predictions['mean'], c='blue', label='lr model')
plt.fill_between(
df['x'][predictions.index],
predictions['mean_ci_lower'],
predictions['mean_ci_upper'],
color='blue',
alpha=0.3,
label='lr model CI 95%',
)
plt.fill_between(
df['x'][predictions.index],
predictions['obs_ci_lower'],
predictions['obs_ci_upper'],
color='green',
alpha=0.3,
label='prediction_interval CI 95%',
)
plt.legend()
plt.title('with statsmodels')
plt.show()
sns.lmplot(x='x', y='y', data=df, fit_reg=True, ci=95, n_boot=1000)
plt.plot(df['x'][predictions.index], predictions['mean'], c='blue', label='lr model')
plt.fill_between(
df['x'][predictions.index],
predictions['mean_ci_lower'],
predictions['mean_ci_upper'],
color='blue',
alpha=0.1,
label='lr model CI 95%',
)
plt.title('sns + statsmodels')
plt.legend()
plt.show()
This code generates 4 plots: first one with linregress, second one with sns, third one with statsmodels and the lastone is the sns + statsmodel, which I post also here.
sns output is the greenone, while the blueone is the model obtained with statsmodels.
Why are the plots obtained with sns and statsmodels different?