Statsmodels + var_weights + cross_val_score

Question

Based on that topic I created a wrapper for statsmodels' glm in order to use scikit's cross_val_score. Now I need to introduce variance/analytic weights via var_weights parameter of SM.GLM.

class Wrapper(BaseEstimator, RegressorMixin):
    def __init__(self, family, alpha, L1_wt, var_weights):
        self.family = family
        self.alpha = alpha
        self.L1_wt = L1_wt
        self.var_weights = var_weights
    def fit(self, X, y):
        self.model = sm.GLM(endog = y, exog = X, family=self.family, var_weights = self.var_weights)
        self.result = self.model.fit_regularized(alpha=self.alpha, L1_wt=self.L1_wt)
        return self.result
    def predict(self, X):
        return self.result.predict(X)

The wrapper let me successfully run:

sm_glm = Wrapper(family, alpha, L1_wt, var_weights)
sm_glm.fit()

But the cross validation

cross_val_score(sm_glm, x, y, cv, scoring)

doesn't work since cross_val_score trims (following cv folds) x and y, but not var_weights and that leads to an error:

ValueError: var weights not the same length as endog

The way I see it, I need to trek dynamically cross_val_score iterations and trim var_weights accordingly.

Any ideas how to create a workaround for that?

score 1 · Accepted Answer · answered Sep 15 '20 at 23:40

I think your guess is correct: cross_val_score is not propagating the weights array into the cross-validation loop and not splitting it in the same way that it splits the X and y into k-folds. I tried to replicate the error you got with some other data (inspired by the StatsModels documentation about the difference between var_weights and freq_weights) and reading through the traceback, it appears that if the additional parameter (i.e. var_weights) is not passed through the fit() method implemented by your custom estimator, it is not picked up to be used in cross-validation.

I went around it in the following way:

class SMW(BaseEstimator, RegressorMixin):
    def __init__(self, family, alpha, L1_wt):
        self.family = family
        self.alpha = alpha
        self.L1_wt = L1_wt
    def fit(self, X, y, var_weights):
        self.model = sm.GLM(endog = y, exog = X, family=self.family, var_weights=var_weights)
        self.result = self.model.fit_regularized(alpha=self.alpha, L1_wt=self.L1_wt)
        return self.result
    def predict(self, X):
        return self.result.predict(X)

In a way, var_weights is actually part of the data, rather than a hyperparameter of the estimator, such as the alpha or L1_wt. So, I think it's a bit more cohesive if you would pass the array in the fit method, instead of the into the class constructor.

So, when you actually want to cross-validate, you can run:

cross_val_score(sm_glm, X, y, scoring, fit_params={'var_weights':var_weights})

and the fit_params will be passed into your custom .fit() method.

As simple as it should be. Thanks! – Ostio Sep 16 '20 at 19:43 — Ostio, Sep 16 '20 at 19:43

Statsmodels + var_weights + cross_val_score

1 Answers1