1

I have datasets with some outliers. From the simple linear regression, using

stat_lin = stats.linregress(X, Y)

I can get coefficient, intercept, r_value, p_value, std_err

But I want to apply robust regression method as I don't want to include outliers.

So I applied Huber regressor from Sklearn,

huber = linear_model.HuberRegressor(alpha=0.0, epsilon=1.35)
huber.fit(mn_all_df['X'].to_numpy().reshape(-1, 1), mn_all_df['Y'].to_numpy().reshape(-1, 1))

from that, I can get, coefficient, intercept, scale, outliers.

I am happy with the result as the coefficient value is higher and the regression line is fitting with the majority of the data points.

However, I need a values such as r value and p value to say, the results from huber regressor is significant.

How can I get r value and p value from the robust regression (my case, using huber regressor)

Dong-gyun Kim
  • 411
  • 5
  • 23

2 Answers2

2

With the HuberRegressor you use sklearn, which does not offer methods for r_value and p_value in their linear_model module. There are other answers which calculate these values from the results of a regression.

In this answer someone shows how the p_values of a linear regression can be calculated. I think this can also be applied with your model.

Edit: I looked into the r value, which is used to calculate the r squared value by squaring it. Following snipped is from the documentation of scipy:

print(f"R-squared: {res.rvalue**2:.6f}")
R-squared: 0.717533

If you have your own regression, you can use this method of sklearn to calculate the r squared value: sklearn.metrics.r2score(y_true, y_pred).

JANO
  • 2,995
  • 2
  • 14
  • 29
2

You can also use robust linear models in statsmodels. For example:

import statsmodels.api as sm
from sklearn import datasets

x = iris.data[:,0]
y = iris.data[:,2]
rlm_model = sm.RLM(y, sm.add_constant(x),
M=sm.robust.norms.HuberT())
rlm_results = rlm_model.fit()

The p value you get from scipy.lingress is the p-value that the slope is not zero, this you can get by doing:

rlm_results.summary()
                     
==============================================================================
                 coef    std err          z      P>|z|      [0.025      0.975]
------------------------------------------------------------------------------
const         -7.1311      0.539    -13.241      0.000      -8.187      -6.076
x1             1.8648      0.091     20.434      0.000       1.686       2.044
==============================================================================

Now the r_value from lingress is a correlation coefficient and it stays as that. With robust linear model, you are weighing your observations differently, hence making it less sensitive to outliers, therefore, the r squared calculation does not make sense here. You might get a lower r squared since you are avoiding the line towards the outlier data points.

See comments by @Josef (who maintains statsmodels) from this question, this answer. You can try this calculation if you would like a meaningful r-squared

How to get R-squared for robust regression (RLM) in Statsmodels?

StupidWolf
  • 45,075
  • 17
  • 40
  • 72