Find RSME and Standard Deviation of a StatsModels Multiple Regression

Question

I currently have a multiple regression that generates an OLS summary based on the life expectancy and the variables that impact it, however that does not include RMSE or standard deviation. Does statsmodels have a rsme library, and is there a way to calculate standard deviation from my code?

I have found a previous example of this problem: regression model statsmodel python , and I read the statsmodels info page: https://www.statsmodels.org/stable/generated/statsmodels.tools.eval_measures.rmse.html and testing I am still not able to get this problem resolved.

import pandas as pd
import openpyxl
import statsmodels.formula.api as smf
import statsmodels.formula.api as ols

df = pd.read_excel(C:/Users/File1.xlsx, sheet_name = 'States')

dfME = df[(df[State] == "Maine")]

pd.set_option('display.max_columns', None)

dfME.head()

model = smf.ols(Life Expectancy ~ Race + Age + Weight + C(Pets), data = dfME) 
modelfit = model.fit()
modelfit.summary

For rmse, you could use another `statsmodels` function as in my answer. What do you want to calculate the standard deviation of? — not_speshal, Jul 26 '21 at 15:56
I am finding the life expectancy per state and looking at my code I have filtered it to the state of Maine only. I will be doing all 50 states and I need to find the standard deviation of each state. It is important for my analysis to know which states have small and larger deviations from the mean. — DayWalker, Jul 26 '21 at 17:09

score 0 · Answer 1 · edited Jul 26 '21 at 19:20

0

You could try something like this:

from statsmodels.tools.eval_measures import rmse
X = dfME[["Race", "Age", "Weight", "C(Pets)"]]
rmse_result = rmse(dfME["Life Expectancy"], model.predict(X))

To get the standard deviation of life expectancy, you can simply use:

stdev = dfME["Life Expectancy"].std()

edited Jul 26 '21 at 19:20

suvayu

4,271
2
29
35

answered Jul 26 '21 at 15:54

not_speshal

22,093
2
15
30

This code is giving me an error: ValueError: shapes (1,4) and (2,6) not aligned: 4 (dim 1) != 2 (dim 0) – DayWalker Jul 26 '21 at 17:19
@DayWalker - See my edit. You probably have `y` as another variable. – not_speshal Jul 26 '21 at 17:26
Is it possible to put this into a for loop to generate all 50 states standard dev using the code above? – DayWalker Jul 27 '21 at 16:42

Aaron Horvitz · Accepted Answer · 2021-07-26T18:56:50.913

It sounds like you mean the Standard Deviation of the Residuals which is calculated using the Root Mean Squared Error. This gives you a measure of how spread out the data points are from the line of best fit. It's often used as a measure of Prediction Error.

There is a lot of information left off the summary in Statsmodels. Fortunately, Statsmodels provides us with alternatives. You can find a list of available properties and methods here: Regression Results

Let's use the variable assignment modelfit from your code. To find the Mean Squared Error of the Residuals, use the mse_resid method in Statsmodels found in the link. To find the RMSE (root mean squared error) of the residuals take the square root of the mean squared error using the square root function in Numpy, sqrt.

Thus the Root Mean Squared Error of the Residuals can be found using this code:

rmse_residuals = np.sqrt(modelfit.mse_resid)

This has been very helpful. I have searched lots of documentation and nowhere did I find .mse_resid. I will be utilizing this a lot in my upcoming analysis. — DayWalker, Jul 27 '21 at 16:43

Find RSME and Standard Deviation of a StatsModels Multiple Regression

2 Answers2