Scikit-Learn HuberRegressor: Calculating standard errors, t-statistics, p-values and confidence intervals

Question

I'm having some issues with getting the standard errors, t-statistics, p-values and 95% confidence intervals of the Huber Regression parameters in Scikit-Learn. All I can return is intercept and coefficients but nothing else.

solution = HuberRegressor()
solution.fit(beta_fac2, B)

print(solution.intercept_)
print(solution.coef_)

I tried implementing the solution from this answer:

params = np.append(solution.intercept_,solution.coef_)
predictions = solution.predict(beta_fac2)

newX = np.append(np.ones((len(beta_fac2),1)), beta_fac2, axis=1)
MSE = (sum((B-predictions)**2))/(len(newX)-len(newX[0]))
var_b = MSE*(np.linalg.inv(np.dot(newX.T,newX)).diagonal())
sd_b = np.sqrt(var_b)
ts_b = params/ sd_b

p_values =[2*(1-stats.t.cdf(np.abs(i),(len(newX)-len(newX[0])))) for i in ts_b]

sd_b = np.round(sd_b,3)
ts_b = np.round(ts_b,3)
p_values = np.round(p_values,3)
params = np.round(params,4)

myDF3 = pd.DataFrame()
myDF3["Coefficients"],myDF3["Standard Errors"],myDF3["t values"],myDF3["Probabilities"] = [params,sd_b,ts_b,p_values]
print(myDF3)

but this line of code

var_b = MSE*(np.linalg.inv(np.dot(newX.T,newX)).diagonal())

returns an error:

numpy.linalg.LinAlgError: Singular matrix

You might want to try using statsmodels, if you need CI, p-value etc, https://www.statsmodels.org/stable/rlm.html — StupidWolf, Nov 05 '21 at 12:52

DanGitR · Answer 1 · 2021-12-01T05:17:56.070

I also followed the answer you cited

to define a function as follows:

def lmSignificance(X, y, lm):
    from sklearn.metrics import mean_squared_error
    params=lm.coef_
# debugged by pault
pd.DataFrame({"Constant":np.ones(len(X))}).join(pd.DataFrame(X.reset_index(drop=True))) 
    predictions = lm.predict(X)
    features = X.columns
    MSE = mean_squared_error(y, predictions)
    var_b = MSE*(np.linalg.inv(np.dot(X.T,X)).diagonal())
    sd_b = np.sqrt(var_b)
    ts_b = params / sd_b
    #p_values =[2*(1-stats.t.cdf(np.abs(i),len(newX)-1)) for i in ts_b]
    p_values =[2*(1-stats.t.cdf(np.abs(i),len(X)-1)) for i in ts_b]
    sd_b = np.round(sd_b,3)
    ts_b = np.round(ts_b,3)
    p_values = np.round(p_values,6)
    params = np.round(params,4)
    sigDF = pd.DataFrame()
    sigDF["Features"],sigDF["Coefficients"],sigDF["Standard Errors"],sigDF["t values"],sigDF["p_values"] = \
    [features,params,sd_b,ts_b,p_values]
    return (sigDF)

Note the line debugged by 'pault' to add a column of ones and the use of mean_squared_error from scikit-learn to calculate MSE.

The function can then be called like this:

dfSigTrain = lmSignificance(X_train, y_train, solution)
print(dfSigTrain)

Where solution is your fit for the HuberRegressor Robust Linear Model (RLM).

Scikit-Learn HuberRegressor: Calculating standard errors, t-statistics, p-values and confidence intervals

1 Answers1