0

I currently have a multiple regression that generates an OLS summary based on the life expectancy and the variables that impact it, however that does not include RMSE or standard deviation. Does statsmodels have a rsme library, and is there a way to calculate standard deviation from my code?

I have found a previous example of this problem: regression model statsmodel python , and I read the statsmodels info page: https://www.statsmodels.org/stable/generated/statsmodels.tools.eval_measures.rmse.html and testing I am still not able to get this problem resolved.

import pandas as pd
import openpyxl
import statsmodels.formula.api as smf
import statsmodels.formula.api as ols

df = pd.read_excel(C:/Users/File1.xlsx, sheet_name = 'States')

dfME = df[(df[State] == "Maine")]

pd.set_option('display.max_columns', None)

dfME.head()

model = smf.ols(Life Expectancy ~ Race + Age + Weight + C(Pets), data = dfME) 
modelfit = model.fit()
modelfit.summary
DayWalker
  • 35
  • 6
  • For rmse, you could use another `statsmodels` function as in my answer. What do you want to calculate the standard deviation of? – not_speshal Jul 26 '21 at 15:56
  • I am finding the life expectancy per state and looking at my code I have filtered it to the state of Maine only. I will be doing all 50 states and I need to find the standard deviation of each state. It is important for my analysis to know which states have small and larger deviations from the mean. – DayWalker Jul 26 '21 at 17:09
  • So the standard deviation of the life expectancy? – not_speshal Jul 26 '21 at 17:26

2 Answers2

0

You could try something like this:

from statsmodels.tools.eval_measures import rmse
X = dfME[["Race", "Age", "Weight", "C(Pets)"]]
rmse_result = rmse(dfME["Life Expectancy"], model.predict(X))

To get the standard deviation of life expectancy, you can simply use:

stdev = dfME["Life Expectancy"].std()
suvayu
  • 4,271
  • 2
  • 29
  • 35
not_speshal
  • 22,093
  • 2
  • 15
  • 30
0

It sounds like you mean the Standard Deviation of the Residuals which is calculated using the Root Mean Squared Error. This gives you a measure of how spread out the data points are from the line of best fit. It's often used as a measure of Prediction Error.

There is a lot of information left off the summary in Statsmodels. Fortunately, Statsmodels provides us with alternatives. You can find a list of available properties and methods here: Regression Results

Let's use the variable assignment modelfit from your code. To find the Mean Squared Error of the Residuals, use the mse_resid method in Statsmodels found in the link. To find the RMSE (root mean squared error) of the residuals take the square root of the mean squared error using the square root function in Numpy, sqrt.

Thus the Root Mean Squared Error of the Residuals can be found using this code:

rmse_residuals = np.sqrt(modelfit.mse_resid)
Aaron Horvitz
  • 166
  • 1
  • 6
  • 1
    This has been very helpful. I have searched lots of documentation and nowhere did I find .mse_resid. I will be utilizing this a lot in my upcoming analysis. – DayWalker Jul 27 '21 at 16:43