16

I have the following code which works normally but got a

UserWarning: One or more of the test scores are non-finite: [nan nan]
  category=UserWarning

when I revised it into a more concise version (shown in the subsequent code snippet). Is the output of the one-hot encoder the culprit of the issue?

import pandas as pd
from sklearn.model_selection import StratifiedKFold
from sklearn.linear_model import RidgeClassifier
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.model_selection import GridSearchCV

train = pd.read_csv('/train.csv')
test = pd.read_csv('/test.csv')
sparse_features = [col for col in train.columns if col.startswith('cat')]
dense_features = [col for col in train.columns if col not in sparse_features+['target']]
X = train.drop(['target'], axis=1)
y = train['target'].values
skf = StratifiedKFold(n_splits=5)
clf = RidgeClassifier()

full_pipeline = ColumnTransformer(transformers=[
    ('num', StandardScaler(), dense_features),
    ('cat', OneHotEncoder(), sparse_features)
])
X_prepared = full_pipeline.fit_transform(X)
param_grid = {
    'alpha': [ 0.1],
    'fit_intercept': [False]
}
gs = GridSearchCV(
    estimator=clf,
    param_grid=param_grid,
    scoring='roc_auc',
    n_jobs=-1,
    cv=skf
)
gs.fit(X_prepared, y)

The revision is shown below.

clf2 = RidgeClassifier()
preprocess_pipeline2 = ColumnTransformer([
    ('num', StandardScaler(), dense_features),
    ('cat', OneHotEncoder(), sparse_features)
])
from sklearn.pipeline import Pipeline
final_pipeline = Pipeline(steps=[
    ('p', preprocess_pipeline2),
    ('c', clf2)
])
param_grid2 = {
    'c__alpha': [0.4, 0.1],
    'c__fit_intercept': [False]
}
gs2 = GridSearchCV(
    estimator=final_pipeline,
    param_grid=param_grid2,
    scoring='roc_auc',
    n_jobs=-1,
    cv=skf
)
gs2.fit(X, y)

Can anyone point out which part goes wrong?

EDIT: After setting error_score to raise, I can receive more feedback regarding the issue. It seems to me that I need to fit the one-hot encoder on the merged dataset that combines the training set and the test set. Am I correct? But if it is the case, why doesn't the first version complain about the same issue? BTW, does it make sense to introduce the argument handle_unknown='ignore' to handle this issue?

ValueError
---------------------------------------------------------------------------
_RemoteTraceback                          Traceback (most recent call last)
_RemoteTraceback: 
"""
Traceback (most recent call last):
  File "/opt/conda/lib/python3.7/site-packages/joblib/externals/loky/process_executor.py", line 431, in _process_worker
    r = call_item()
  File "/opt/conda/lib/python3.7/site-packages/joblib/externals/loky/process_executor.py", line 285, in __call__
    return self.fn(*self.args, **self.kwargs)
  File "/opt/conda/lib/python3.7/site-packages/joblib/_parallel_backends.py", line 595, in __call__
    return self.func(*args, **kwargs)
  File "/opt/conda/lib/python3.7/site-packages/joblib/parallel.py", line 263, in __call__
    for func, args, kwargs in self.items]
  File "/opt/conda/lib/python3.7/site-packages/joblib/parallel.py", line 263, in <listcomp>
    for func, args, kwargs in self.items]
  File "/opt/conda/lib/python3.7/site-packages/sklearn/utils/fixes.py", line 222, in __call__
    return self.function(*args, **kwargs)
  File "/opt/conda/lib/python3.7/site-packages/sklearn/model_selection/_validation.py", line 620, in _fit_and_score
    test_scores = _score(estimator, X_test, y_test, scorer, error_score)
  File "/opt/conda/lib/python3.7/site-packages/sklearn/model_selection/_validation.py", line 674, in _score
    scores = scorer(estimator, X_test, y_test)
  File "/opt/conda/lib/python3.7/site-packages/sklearn/metrics/_scorer.py", line 200, in __call__
    sample_weight=sample_weight)
  File "/opt/conda/lib/python3.7/site-packages/sklearn/metrics/_scorer.py", line 334, in _score
    y_pred = method_caller(clf, "decision_function", X)
  File "/opt/conda/lib/python3.7/site-packages/sklearn/metrics/_scorer.py", line 53, in _cached_call
    return getattr(estimator, method)(*args, **kwargs)
  File "/opt/conda/lib/python3.7/site-packages/sklearn/utils/metaestimators.py", line 120, in <lambda>
    out = lambda *args, **kwargs: self.fn(obj, *args, **kwargs)
  File "/opt/conda/lib/python3.7/site-packages/sklearn/pipeline.py", line 493, in decision_function
    Xt = transform.transform(Xt)
  File "/opt/conda/lib/python3.7/site-packages/sklearn/compose/_column_transformer.py", line 565, in transform
    Xs = self._fit_transform(X, None, _transform_one, fitted=True)
  File "/opt/conda/lib/python3.7/site-packages/sklearn/compose/_column_transformer.py", line 444, in _fit_transform
    self._iter(fitted=fitted, replace_strings=True), 1))
  File "/opt/conda/lib/python3.7/site-packages/joblib/parallel.py", line 1044, in __call__
    while self.dispatch_one_batch(iterator):
  File "/opt/conda/lib/python3.7/site-packages/joblib/parallel.py", line 859, in dispatch_one_batch
    self._dispatch(tasks)
  File "/opt/conda/lib/python3.7/site-packages/joblib/parallel.py", line 777, in _dispatch
    job = self._backend.apply_async(batch, callback=cb)
  File "/opt/conda/lib/python3.7/site-packages/joblib/_parallel_backends.py", line 208, in apply_async
    result = ImmediateResult(func)
  File "/opt/conda/lib/python3.7/site-packages/joblib/_parallel_backends.py", line 572, in __init__
    self.results = batch()
  File "/opt/conda/lib/python3.7/site-packages/joblib/parallel.py", line 263, in __call__
    for func, args, kwargs in self.items]
  File "/opt/conda/lib/python3.7/site-packages/joblib/parallel.py", line 263, in <listcomp>
    for func, args, kwargs in self.items]
  File "/opt/conda/lib/python3.7/site-packages/sklearn/utils/fixes.py", line 222, in __call__
    return self.function(*args, **kwargs)
  File "/opt/conda/lib/python3.7/site-packages/sklearn/pipeline.py", line 733, in _transform_one
    res = transformer.transform(X)
  File "/opt/conda/lib/python3.7/site-packages/sklearn/preprocessing/_encoders.py", line 462, in transform
    force_all_finite='allow-nan')
  File "/opt/conda/lib/python3.7/site-packages/sklearn/preprocessing/_encoders.py", line 136, in _transform
    raise ValueError(msg)
ValueError: Found unknown categories ['MR', 'MW', 'DA'] in column 10 during transform
"""

The above exception was the direct cause of the following exception:

ValueError                                Traceback (most recent call last)
<ipython-input-48-b81f3b7b0724> in <module>
     21     cv=skf
     22 )
---> 23 gs2.fit(X, y)

/opt/conda/lib/python3.7/site-packages/sklearn/utils/validation.py in inner_f(*args, **kwargs)
     61             extra_args = len(args) - len(all_args)
     62             if extra_args <= 0:
---> 63                 return f(*args, **kwargs)
     64 
     65             # extra_args > 0

/opt/conda/lib/python3.7/site-packages/sklearn/model_selection/_search.py in fit(self, X, y, groups, **fit_params)
    839                 return results
    840 
--> 841             self._run_search(evaluate_candidates)
    842 
    843             # multimetric is determined here because in the case of a callable

/opt/conda/lib/python3.7/site-packages/sklearn/model_selection/_search.py in _run_search(self, evaluate_candidates)
   1286     def _run_search(self, evaluate_candidates):
   1287         """Search all candidates in param_grid"""
-> 1288         evaluate_candidates(ParameterGrid(self.param_grid))
   1289 
   1290 

/opt/conda/lib/python3.7/site-packages/sklearn/model_selection/_search.py in evaluate_candidates(candidate_params, cv, more_results)
    807                                    (split_idx, (train, test)) in product(
    808                                    enumerate(candidate_params),
--> 809                                    enumerate(cv.split(X, y, groups))))
    810 
    811                 if len(out) < 1:

/opt/conda/lib/python3.7/site-packages/joblib/parallel.py in __call__(self, iterable)
   1052 
   1053             with self._backend.retrieval_context():
-> 1054                 self.retrieve()
   1055             # Make sure that we get a last message telling us we are done
   1056             elapsed_time = time.time() - self._start_time

/opt/conda/lib/python3.7/site-packages/joblib/parallel.py in retrieve(self)
    931             try:
    932                 if getattr(self._backend, 'supports_timeout', False):
--> 933                     self._output.extend(job.get(timeout=self.timeout))
    934                 else:
    935                     self._output.extend(job.get())

/opt/conda/lib/python3.7/site-packages/joblib/_parallel_backends.py in wrap_future_result(future, timeout)
    540         AsyncResults.get from multiprocessing."""
    541         try:
--> 542             return future.result(timeout=timeout)
    543         except CfTimeoutError as e:
    544             raise TimeoutError from e

/opt/conda/lib/python3.7/concurrent/futures/_base.py in result(self, timeout)
    433                 raise CancelledError()
    434             elif self._state == FINISHED:
--> 435                 return self.__get_result()
    436             else:
    437                 raise TimeoutError()

/opt/conda/lib/python3.7/concurrent/futures/_base.py in __get_result(self)
    382     def __get_result(self):
    383         if self._exception:
--> 384             raise self._exception
    385         else:
    386             return self._result

ValueError: Found unknown categories ['MR', 'MW', 'DA'] in column 10 during transform
Li-Pin Juan
  • 1,156
  • 1
  • 13
  • 22

4 Answers4

6

First, I'd like to say that I had a similar issue, and thanks for noting

After setting error_score to raise

which really helped me with my issue. I was using a custom transformer and I had some code that was creating variables in the training fold, and then not creating them in the validation fold because those categories were not present in validation. I think you're having a similar problem.

Seems like OneHotEncoder is maybe creating some categories in your training fold and then finding new categories in your validation fold that are unknown to it because they didn't exist in the training fold.

ValueError: Found unknown categories ['MR', 'MW', 'DA'] in column 10 during transform

To get around this, my suggestion would be to look into using custom transformers instead since your data is more complex. https://towardsdatascience.com/custom-transformers-and-ml-data-pipelines-with-python-20ea2a7adb65

yeamusic21
  • 276
  • 3
  • 11
3

Remove roc_auc if it is multi class. They do not play well together. Use default scoring or choose something else.

John K
  • 31
  • 2
0

I stumbled here as I got the same warning message while trying to do multiclass classification. My problem was that I tried to use scoring='roc_auc' with GridSearchCV, which doesn't work with multiclass. I used scoring='f1_micro' instead, which worked fine with multiclass.

F1 scoring is discussed e.g. in here: How to do GridSearchCV for F1-score in classification problem with scikit-learn? A list of different scoring options can be found here (part 3.3.1): https://scikit-learn.org/stable/modules/model_evaluation.html

morko
  • 13
  • 1
  • 4
  • Your answer could be improved with additional supporting information. Please [edit] to add further details, such as citations or documentation, so that others can confirm that your answer is correct. You can find more information on how to write good answers [in the help center](/help/how-to-answer). – Community May 23 '22 at 14:56
  • While this link may answer the question, it is better to include the essential parts of the answer here and provide the link for reference. Link-only answers can become invalid if the linked page changes. - [From Review](/review/late-answers/31843766) – Emi OB May 26 '22 at 13:51
0

I also encountered a similar issue, and I have a suggestion that might help you. Instead of including transformers in your pipeline that don't require parameter tuning, it's better to only pass objects whose parameters you want to optimize when using GridSearchCV.

In your case, it seems unlikely that you would need to tune the OneHotEncoder. Therefore, my recommendation is to first apply each step of your pipeline separately to the dataset and obtain the processed dataset.

Afterward, you can safely use GridSearchCV to tune the models by specifying the relevant parameters to optimize. You can retrieve the optimal parameters using the best_params_ method and then integrate them back into your pipeline and continue with what you intend on doing with it.

Destroy666
  • 892
  • 12
  • 19