I have concrete problem with extending xgb.XGBClassifier
class, but it could be framed as general OOP question.
My implementation is based on: https://github.com/dmlc/xgboost/blob/master/python-package/xgboost/sklearn.py
Basically, I want add feature names handling when provided data is in pandas DataFrame.
A few remarks:
XGBClassifierN
has the same parameters in__init__
as base classxgb.XGBClassifier
,- there is an additional attribute
self.feature_names
that is set by laterfit
method. - Rest could be done by mix-ins.
It works.
What bothers me, is this wall of code in __init__
. It was done by copy-paste defaults and every time when xgb.Classifier
will change it had to be updated.
Is there any way to concise express idea that child class XGBClassifierN
has the same parameters and defaults as parent class xgb.XGBClassifier
and do later things like clf = XGBClassifierN(n_jobs=-1)
?
I've tried to use only **kwargs
but it doesn't work out (interpreter starts to complain that there is no missing
parameter (no pun intentented), and to make it work basically you need to set a few more parameters).
import xgboost as xgb
class XGBClassifierN(xgb.XGBClassifier):
def __init__(self, base_score=0.5, booster='gbtree', colsample_bylevel=1,
colsample_bynode=1, colsample_bytree=1, gamma=0,
learning_rate=0.1, max_delta_step=0, max_depth=3,
min_child_weight=1, missing=None, n_estimators=100, n_jobs=1,
nthread=None, objective='binary:logistic', random_state=0,
reg_alpha=0, reg_lambda=1, scale_pos_weight=1, seed=None,
silent=None, subsample=1, verbosity=1, **kwargs):
super().__init__(base_score=base_score, booster=booster, colsample_bylevel=colsample_bylevel,
colsample_bynode=colsample_bynode, colsample_bytree=colsample_bytree, gamma=gamma,
learning_rate=learning_rate, max_delta_step=max_delta_step, max_depth=max_depth,
min_child_weight=min_child_weight, missing=missing, n_estimators=n_estimators, n_jobs=n_jobs,
nthread=nthread, objective=objective, random_state=random_state,
reg_alpha=reg_alpha, reg_lambda=reg_lambda, scale_pos_weight=scale_pos_weight, seed=seed,
silent=silent, subsample=subsample, verbosity=verbosity, **kwargs)
self.feature_names = None
def fit(self, X, y=None):
self.feature_names = list(X.columns)
return super().fit(X, y)
def get_feature_names(self):
if not isinstance(self.feature_names, list):
raise ValueError('Must fit data first!')
else:
return self.feature_names
def get_feature_importances(self):
return dict(zip(self.get_feature_names(), self.feature_importances_))