Linear Regression with scipy.linregress, sns and statsmodels output different results

Question

I'm trying to find the 95% ci for the linear regression model of a set of data. Data are provided in the code below. I'm trying to do so using scipy.linregress, then I moved on as scipy doesn't provide a useful method to calculate the 95% CI so I came across this question where I read that seaborn uses statsmodels under the hood. I tried both method and I was surprised to see different results. I let here the code per reproducibility.

import matplotlib.pyplot as plt
import pandas as pd
from statsmodels import api as sm
import seaborn as sns
from scipy import stats


df = pd.DataFrame(
    [
        [371.0, 577.0],
        [621.0, 493.0],
        [697.0, 356.0],
        [762.0, 386.0],
        [820.0, 283.0],
        [832.0, 354.0],
        [849.0, 540.0],
        [864.0, 370.0],
        [916.0, 284.0],
        [949.0, 95.0],
        [970.0, 145.0],
        [1051.0, 95.0],
        [1101.0, 45.0],
        [1190.0, 45.0],
    ],
    index=[
        2014,
        2015,
        2016,
        2017,
        2011,
        2018,
        2013,
        2012,
        2019,
        2009,
        2010,
        2008,
        2007,
        2006,
    ],
    columns=['x', 'y'],
)

print(df)

lr = stats.linregress(df['x'], df['y'])
# lr.slope, lr.intercept, lr.rvalue, lr.pvalue, lr.stderr, lr.intercept_stderr
tinv = lambda p, df: abs(stats.t.ppf(p / 2, df))
ts = tinv(0.05, len(df.index) - 2)

print(f'pvalue: {lr.pvalue:6f}.')
print(f'R-squared: {lr.rvalue**2:.6f}.')
print(
    f'slope: {lr.slope:.3f}, 95%CI[{lr.slope - ts * lr.stderr:.3f} - {lr.slope + ts * lr.stderr:.3f}].'
)
print(
    f'intercept: {lr.intercept:.3f}, 95%CI[{lr.intercept - ts * lr.intercept_stderr:.3f} - {lr.intercept + ts * lr.intercept_stderr:.3f}].'
)


plt.scatter(df['x'], df['y'], label='data')
plt.plot(df['x'], df['x'] * lr.slope + lr.intercept, c='blue', label='lr model')
plt.legend()
plt.title('with scipy')
plt.show()

sns.lmplot(x='x', y='y', data=df, fit_reg=True, ci=95, n_boot=1000)
plt.title('with sns')
plt.show()


X = sm.add_constant(df['x'])
ols_model = sm.OLS(df['y'], X)
est = ols_model.fit()
print(est.summary())
predictions = est.get_prediction(X).summary_frame()

plt.scatter(df['x'], df['y'], label='data')
plt.plot(df['x'][predictions.index], predictions['mean'], c='blue', label='lr model')
plt.fill_between(
    df['x'][predictions.index],
    predictions['mean_ci_lower'],
    predictions['mean_ci_upper'],
    color='blue',
    alpha=0.3,
    label='lr model CI 95%',
)
plt.fill_between(
    df['x'][predictions.index],
    predictions['obs_ci_lower'],
    predictions['obs_ci_upper'],
    color='green',
    alpha=0.3,
    label='prediction_interval CI 95%',
)

plt.legend()
plt.title('with statsmodels')
plt.show()

sns.lmplot(x='x', y='y', data=df, fit_reg=True, ci=95, n_boot=1000)
plt.plot(df['x'][predictions.index], predictions['mean'], c='blue', label='lr model')
plt.fill_between(
    df['x'][predictions.index],
    predictions['mean_ci_lower'],
    predictions['mean_ci_upper'],
    color='blue',
    alpha=0.1,
    label='lr model CI 95%',
)

plt.title('sns + statsmodels')
plt.legend()
plt.show()

This code generates 4 plots: first one with linregress, second one with sns, third one with statsmodels and the lastone is the sns + statsmodel, which I post also here.

sns output is the greenone, while the blueone is the model obtained with statsmodels.

Why are the plots obtained with sns and statsmodels different?

also, AFAICS, you use bootstrap intervals for seaborn, while statsmodels only uses asymptotic intervals. — Josef, Dec 05 '22 at 17:58
So the difference is due to the different calculation of asymptotic intervals? — Zeno Dalla Valle, Dec 05 '22 at 18:08
The default alpha for OLSResults is 0.05 https://www.statsmodels.org/dev/generated/statsmodels.regression.linear_model.OLSResults.conf_int.html — Zeno Dalla Valle, Dec 05 '22 at 18:10
sorry, (I don't like greek letters), I mixed up alpha in fill_between with alpha for confints. — Josef, Dec 05 '22 at 20:33
bootstrap does not rely on asymptotic distributional assumptions. So the two methods will produce different results unless the sample size is large (and distribution is approximately normal) — Josef, Dec 05 '22 at 20:35

Linear Regression with scipy.linregress, sns and statsmodels output different results

0 Answers0