0

I am trying to plot confidence intervals in my matplotlib plot with the seaborn style (similar to what the regplot fuction in seaborn would give but with the ability to run statistics from the regression).

My current plot looks like this: enter image description here

Which is created with the following code:

#Read in proper dataframe
storms_per_year = pd.read_csv('number_of_storms_per_year.csv')

#Create linear regression function


def lin_reg(x,y):
    linreg = LinearRegression().fit(x,y)
    print(linreg.intercept_, linreg.coef_, linreg.score(x,y))
    n = sm.add_constant(x)
    results = sm.OLS(y, n).fit()
    conf_interval = results.conf_int(0.05)
    print(results.summary())
pass


#Define variables for linear regression - frequency
x_col ='season'
y_col = 'days'
x = storms_per_year[x_col]
y = storms_per_year[y_col]
x_array = np.array(x).reshape(-1,1)
y_array = np.array(y).reshape(-1,1)

linreg = LinearRegression().fit(x_array,y_array)

#Perform linear regression for frequency 
lin_reg(x_array,y_array)

#Plot
sns.set_theme(context='notebook', style='darkgrid')
sns.light_palette("#79C")

plt.scatter(x_array,y_array, alpha = 0.25)
plt.plot(x_array,linreg.predict(x_array), label='y=-0.0291x+69.2610')
plt.xlabel('Season')
plt.ylabel('Number of Storms')
plt.title('Frequency of Storms Over Time')
plt.legend()
plt.show

I have tried the following with successful confidence intervals:

import pydove as dv
#Plot-----------------------------
#Set variables
x_col ='season'
y_col = 'days'
x = storms_per_year[x_col]
y = storms_per_year[y_col]

fig, ax = plt.subplots()
res = dv.regplot(x,y, ax=ax )
ax.set_xlabel('Season')
ax.set_ylabel('Number of Storms')
ax.set_title('Frequency of Storms Over Time')
fig.set_label(res)

reg_line = mlines.Line2D([],[])

plt.legend()
res.summary()

Which results in: enter image description here

But then I cannot add the statistical info to the legend as I want to do. Any suggestions are welcome.

MateaMar
  • 25
  • 7
  • Why does your `lin_reg` function perform the same regression twice using both sklearn `LinearRegression` and statsmodels `OLS`? – tdy Jan 12 '23 at 15:37
  • Also it doesn't seem like you use your `lin_reg` function at all. You actually perform the same regression a third time outside the function using another `LinearRegression`. – tdy Jan 12 '23 at 15:38
  • I didn't realize I was being redundant, it was written in a pieced together way. Open to feedback if you have simplification suggestions. – MateaMar Jan 12 '23 at 19:53
  • It's not only redundant, but also inaccurate to report statsmodels coefficients for an sklearn model. They might not be (too) different for a simple linear regression, but it's still bad practice in general. You should perform the regression just once (either sklearn or statsmodels) and extract the relevant info from that one model. [Here is a recent example using statsmodels (just add a `label` to the regression line for your legend use case).](https://stackoverflow.com/a/75010700/13138364) – tdy Jan 12 '23 at 20:13

0 Answers0