I read the relevant discussion here: Using statsmodel estimations with scikit-learn cross validation, is it possible?
In the discussion from the link it is advised to use a wrapper for models from statsmodels
such that the cross_val_score
function from the sklearn
library can be used. The code does run, but I am not sure what arguments to provide.
Example code:
class SMWrapper(BaseEstimator, RegressorMixin):
""" A sklearn-style wrapper for formula-based statsmodels regressors """
def __init__(self, model_class, formula, family, data):
self.model_class = model_class # choose the model from statsmodels.formula.api library (i.e. logit or glm)
self.formula = formula # expression using patsy syntax as required by statsmodels
self.data = data # the full dataframe as required by statsmodels
self.family = family # the family argument as required by statsmodels
def fit(self, X=None, y=None):
self.model = self.model_class(self.formula, data=self.data, family=self.family)
self.results = self.model.fit()
def predict(self, X):
return self.results.predict(X)
formula = 'wage ~ workhours + np.power(workhours, 3) + C(gender)'
wrappedModel = SMWrapper(model_class=glm, formula=formula, data=df, family=sm.families.Poisson(sm.families.links.log()))
-1*cross_val_score(wrappedModel, X=df[["workhours", "gender"]], y=df["wage"], scoring="neg_mean_squared_error", cv=10, error_score='raise')
Questions:
- Is it correct that I specify
X=df[["workhours", "gender"]]
andy=df["wage"]
as arguments incross_val_score
?statsmodels
just takes an entire dataframedf
as input and the formula is specified with patsy syntax. In contrast,sklearn
models require a separate X and y argument. - Does
sklearn
'scross_val_score
also use the variablenp.power(workhours, 3)
in the example? I think it does, because if I reorder the formula such thatnp.power(workhours, 3)
is the first variable after the ~, and leave out workhours such thatX=df["gender"]
instead ofX=df[["workhours", "gender"]]
, the error states that the variablenp.power(workhours, 3)
is unknown. - What if I change the formula to
wage ~ workhours + np.power(workhours, 3) + C(gender, Treatment(reference='female'))
. Instatsmodels
, this leaves out female such that the reference group isfemale
. Will thesklearn
model change accordingly? - Why do I need the arguments
X=None
andy=None
?
Full code example:
import pandas as pd
import numpy as np
import statsmodels.api as sm
from statsmodels.formula.api import glm
import random
from sklearn.model_selection import cross_val_score
from sklearn.base import BaseEstimator, RegressorMixin
# generate explanatory variables
x1 = np.random.normal(40, 4, 1000)
x2 = random.choices(["Male", "Female"], k=1000)
error = np.random.normal(0, 1, 1000)
y = 1234 + (4*x1) + error
# collect data in a dataframe
df = pd.DataFrame(zip(y, x1, x2), columns=['wage', 'workhours', 'gender'])
class SMWrapper(BaseEstimator, RegressorMixin):
""" A sklearn-style wrapper for formula-based statsmodels regressors """
def __init__(self, model_class, formula, family, data):
self.model_class = model_class
self.formula = formula
self.data = data
self.family = family
def fit(self, X=None, y=None):
self.model = self.model_class(self.formula, data=self.data, family=self.family)
self.results = self.model.fit()
def predict(self, X):
return self.results.predict(X)
formula = 'wage ~ workhours + np.power(workhours, 3) + C(gender)'
wrappedModel = SMWrapper(model_class=glm, formula=formula, data=df, family=sm.families.Poisson(sm.families.links.log()))
-1*cross_val_score(wrappedModel, X=df[["workhours", "gender"]], y=df["wage"], scoring="neg_mean_squared_error", cv=10, error_score='raise')