How to use the commonly used wrapper for models from statsmodels to apply cross-validation?

Question

I read the relevant discussion here: Using statsmodel estimations with scikit-learn cross validation, is it possible?

In the discussion from the link it is advised to use a wrapper for models from statsmodels such that the cross_val_score function from the sklearn library can be used. The code does run, but I am not sure what arguments to provide.

Example code:

class SMWrapper(BaseEstimator, RegressorMixin):
""" A sklearn-style wrapper for formula-based statsmodels regressors """
def __init__(self, model_class, formula, family, data):
    self.model_class = model_class      # choose the model from statsmodels.formula.api library (i.e. logit or glm)
    self.formula = formula              # expression using patsy syntax as required by statsmodels
    self.data = data                    # the full dataframe as required by statsmodels
    self.family = family                # the family argument as required by statsmodels

def fit(self, X=None, y=None):
    self.model = self.model_class(self.formula, data=self.data, family=self.family)
    self.results = self.model.fit()

def predict(self, X):
    return self.results.predict(X)

formula = 'wage ~ workhours + np.power(workhours, 3) + C(gender)'
wrappedModel = SMWrapper(model_class=glm, formula=formula, data=df, family=sm.families.Poisson(sm.families.links.log()))
-1*cross_val_score(wrappedModel, X=df[["workhours", "gender"]], y=df["wage"], scoring="neg_mean_squared_error", cv=10, error_score='raise')

Questions:

Is it correct that I specify X=df[["workhours", "gender"]] and y=df["wage"] as arguments in cross_val_score? statsmodels just takes an entire dataframe df as input and the formula is specified with patsy syntax. In contrast, sklearn models require a separate X and y argument.
Does sklearn's cross_val_score also use the variable np.power(workhours, 3) in the example? I think it does, because if I reorder the formula such that np.power(workhours, 3) is the first variable after the ~, and leave out workhours such that X=df["gender"] instead of X=df[["workhours", "gender"]], the error states that the variable np.power(workhours, 3) is unknown.
What if I change the formula to wage ~ workhours + np.power(workhours, 3) + C(gender, Treatment(reference='female')). In statsmodels, this leaves out female such that the reference group is female. Will the sklearn model change accordingly?
Why do I need the arguments X=None and y=None?

Full code example:

import pandas as pd
import numpy as np
import statsmodels.api as sm
from statsmodels.formula.api import glm
import random
from sklearn.model_selection import cross_val_score
from sklearn.base import BaseEstimator, RegressorMixin

# generate explanatory variables
x1 = np.random.normal(40, 4, 1000)
x2 = random.choices(["Male", "Female"], k=1000)
error = np.random.normal(0, 1, 1000)
y = 1234 + (4*x1) + error

# collect data in a dataframe
df = pd.DataFrame(zip(y, x1, x2), columns=['wage', 'workhours', 'gender'])

class SMWrapper(BaseEstimator, RegressorMixin):
    """ A sklearn-style wrapper for formula-based statsmodels regressors """
    def __init__(self, model_class, formula, family, data):
        self.model_class = model_class
        self.formula = formula
        self.data = data
        self.family = family
    def fit(self, X=None, y=None):
        self.model = self.model_class(self.formula, data=self.data, family=self.family)
        self.results = self.model.fit()
    def predict(self, X):
        return self.results.predict(X)

formula = 'wage ~ workhours + np.power(workhours, 3) + C(gender)'
wrappedModel = SMWrapper(model_class=glm, formula=formula, data=df, family=sm.families.Poisson(sm.families.links.log()))
-1*cross_val_score(wrappedModel, X=df[["workhours", "gender"]], y=df["wage"], scoring="neg_mean_squared_error", cv=10, error_score='raise')

How to use the commonly used wrapper for models from statsmodels to apply cross-validation?

0 Answers0

Linked