I have the following problem. I am working on a script that tests different ML models on a (text) dataset with a binary classification problem. When using 'f1' (or precision for example) as scoring measure in the GridSearchCV I do get a results for every model, however when using 'roc_auc' I get an error. When looking at this Stackoverflow post (scoring "roc_auc" value is not working with gridsearchCV appling RandomForestclassifer), at first I thought the problem was that in one fold of the CV, the test labels are only from one class. However, StratifiedKFold is used, so this should not be the problem right?
f1 output
gscv.cv_results_['mean_test_score']
Out[57]:
array([0.81858624, 0.81858624, 0.81858624, 0.81858624, 0.81858624,
0.81858624, 0.81858624, 0.81858624, 0.81858624, 0.81858624,
0.81858624, 0.81858624, 0.81127153, 0.82666385, 0.82666385,
0.815 , 0.83393442, 0.82793442, 0.80629975, 0.83933983,
0.83933983])
roc_auc output
It gives "nan" for every model, and the following errors:
AttributeError: 'ClfSwitcher' object has no attribute 'classes_'
AttributeError: 'ClfSwitcher' object has no attribute 'decision_function'
----------------------------- Python Code --------------------------------
class ClfSwitcher(BaseEstimator):
def __init__(self, estimator=LogisticRegression()):
"""
A Custom BaseEstimator that can switch between classifiers.
The given classifier must implement the following methods: fit, predict, predict_proba, score
https://stackoverflow.com/questions/48507651/multiple-classification-models-in-a-scikit-pipeline-python
:param estimator: sklearn object - The classifier
"""
self.estimator = estimator
@property
def estimator(self):
return self._estimator
@estimator.setter
def estimator(self, e):
assert all([hasattr(e, x) for x in ['fit', 'predict', 'predict_proba', 'score']])
self._estimator = e
def fit(self, X, y=None):
self.estimator.fit(X, y)
return self
def predict(self, X):
return self.estimator.predict(X)
def predict_proba(self, X):
return self.estimator.predict_proba(X)
def score(self, X, y):
return self.estimator.score(X, y)
pipeline = Pipeline([
('tfidf', TfidfVectorizer(stop_words = 'english', max_df=0.9, min_df=2, max_features=50000)),
('clf', ClfSwitcher())
])
param_grids = [
{
'clf__estimator': [LogisticRegression(max_iter=1e3)],
'tfidf__ngram_range': [(1, 1), (1, 2), (1, 3)],
'clf__estimator__C': np.logspace(-3, 3, 7)
},
]
gscv = GridSearchCV(pipeline, param_grid=param_grids,
scoring=('f1'),
cv = 5, verbose=2, n_jobs=-1)
X = uitspraak_beslissing_klein['feiten en omstandigheden']
y = uitspraak_beslissing_klein['Werkelijk - handmatig'].map({'Onzakelijk': 1, 'Zakelijk' : 0})
gscv.fit(X, y)