6

I am using rlm function from MASS to perform a robust regression. Unlike lm,the summary function does not return a valure for r-squared.

Is therefore appropriate to calculate this using 1 - sum(residual^2)/(sum((Y-mean(Y))^2)?

(apologies for the equation, I could not figure how to write this in a better formatted way)

Electromagnet
  • 161
  • 1
  • 5
  • 1
    This question already has an answer on the stats stackexchange site [here](https://stats.stackexchange.com/questions/83826/is-a-weighted-r2-in-robust-linear-model-meaningful-for-goodness-of-fit-analys) – Allan Cameron Feb 05 '20 at 10:45
  • Thanks for responding. Appreciate your help in pointing me in the right direction – Electromagnet Feb 05 '20 at 23:33

2 Answers2

8

Robust Linear Regression(rlm) is used instead of Linear Regression(lm) when data has many outliers; also it can be used for detecting influential observations. Robust regression uses Iteratively Reweighted Least Squares(IRLS) for Maximum Likelihood Estimation(MLE) whereas linear regression uses Ordinary Least Squares(OLS), which is the reason R-squared(coefficient of determination) is returned by lm() and not by rlm().

Now coming to the appropriateness, it is not an appropriate measure to assess the fit for robust regression since it involves computing squared loss=sum(residual^2)=sum(predicted values-observed values)^2 in the formula for r-squared. Since robust regression involves dealing with data containing many outliers, the metric will result in absurd value due to large values produced from residuals for outliers which are large and also squared!
This is the reason why absolute loss=(predict-actual) is used to asses fit when outliers are involved.

Hope this helps.

Nizam
  • 340
  • 1
  • 6
  • 11
1

One interpretation of what r2 is telling us about our regression models is how much better our model is from a simple statistical average. Since averages are not robust, we can potentially reformulate a robust r2 using medians (in the same way we use median absolute deviation as a robust measure of variability instead of standard deviation). If we do this, we get similar r2 values for normal behaving data and a better goodness-of-fit metric for robust models:

import numpy as np


def robust_r2_score(obs, exps):
    """
    Formulation of a robust r2. in this case, we have a metric that tells us how
    much better our model is than a median of our observations using median squared
    deviation rather than variance.
    
    Args:
        obs: a sequence of observations
        exps: a sequence of expected values from a model
    """
    msdtot = lambda x: np.median((x-np.median(x))**2) # median squared deviation total
    msderr = lambda x, x_hat: np.median((x-x_hat)**2) # median squared deviation err (or residual)
    obs = np.asarray(obs)
    exp = np.asarray(exps)
    msd_res = msderr(obs, exp)
    msd_tot = msdtot(obs)
    return 1-(msd_res/msd_tot)

Just a thought. Wondering if other have tried something similar in practice? I have a similar pattern in this paper

philometer
  • 11
  • 2