Isolation Forest Parameter tuning with gridSearchCV

Question

I have multi variate time series data, want to detect the anomalies with isolation forest algorithm. want to get best parameters from gridSearchCV, here is the code snippet of gridSearch CV.

input data set loaded with below snippet.

df = pd.read_csv("train.csv")
df.drop(['dataTimestamp','Anomaly'], inplace=True, axis=1)
X_train = df
y_train = df1[['Anomaly']] ( Anomaly column is labelled data).

define the parameters for Isolation Forest.

clf = IsolationForest(random_state=47, behaviour='new', score="accuracy")
param_grid = {'n_estimators': list(range(100, 800, 5)), 'max_samples': list(range(100, 500, 5)), 'contamination': [0.1, 0.2, 0.3, 0.4, 0.5], 'max_features': [5,10,15], 'bootstrap': [True, False], 'n_jobs': [5, 10, 20, 30]}

f1sc = make_scorer(f1_score)
grid_dt_estimator = model_selection.GridSearchCV(clf, param_grid,scoring=f1sc, refit=True,cv=10, return_train_score=True)
grid_dt_estimator.fit(X_train, y_train)

after executing the fit , got the below error.

ValueError: Target is multiclass but average='binary'. Please choose another average setting.

Can some one guide me what is this about, tried average='weight', but still no luck, anything am doing wrong here. please let me know how to get F-score as well.

score 8 · Accepted Answer · answered May 12 '19 at 17:49

You incur in this error because you didn't set the parameter average when transforming the f1_score into a scorer. In fact, as detailed in the documentation:

average : string, [None, ‘binary’ (default), ‘micro’, ‘macro’, ‘samples’, ‘weighted’] This parameter is required for multiclass/multilabel targets. If None, the scores for each class are returned.

The consequence is that the scorer returns multiple scores for each class in your classification problem, instead of a single measure. The solution is to declare one of the possible values of the average parameter for f1_score, depending on your needs. I therefore refactored the code you provided as an example in order to provide a possible solution to your problem:

from sklearn.ensemble import IsolationForest
from sklearn.metrics import make_scorer, f1_score
from sklearn import model_selection
from sklearn.datasets import make_classification

X_train, y_train = make_classification(n_samples=500, 
                                       n_classes=2)

clf = IsolationForest(random_state=47, behaviour='new')

param_grid = {'n_estimators': list(range(100, 800, 5)), 
              'max_samples': list(range(100, 500, 5)), 
              'contamination': [0.1, 0.2, 0.3, 0.4, 0.5], 
              'max_features': [5,10,15], 
              'bootstrap': [True, False], 
              'n_jobs': [5, 10, 20, 30]}

f1sc = make_scorer(f1_score(average='micro'))

grid_dt_estimator = model_selection.GridSearchCV(clf, 
                                                 param_grid,
                                                 scoring=f1sc, 
                                                 refit=True,
                                                 cv=10, 
                                                 return_train_score=True)
grid_dt_estimator.fit(X_train, y_train)

Hi Luca, Thanks a lot your response. got the below error after modified the code f1sc = make_scorer(f1_score(average='micro')) , the error message is as follows (TypeError: f1_score() missing 2 required positional arguments: 'y_true' and 'y_pred'). — Anantha, May 13 '19 at 05:17
issue has been resolved after label the data with 1 and -1 instead of 0 and 1. — Anantha, May 16 '19 at 03:56
I get the same error even after changing it to -1 and 1 Counter({-1: 250, 1: 250}) --------------------------------------------------------------------------- TypeError: f1_score() missing 2 required positional arguments: 'y_true' and 'y_pred' — BigDataScientist, Aug 07 '19 at 18:32
What will be the `scoring` parameter for unsupervised case of `IsolationForest`? — hafiz031, Jun 21 '21 at 07:01

score 2 · Answer 2 · answered Aug 22 '19 at 11:18

2

Update make_scorer with this to get it working.

make_scorer(f1_score, average='micro')

answered Aug 22 '19 at 11:18

Gayathri Manohar

21
3

Can you please help me with this, I have tried your solution but It does not work. My data is not labeled. https://stackoverflow.com/questions/58186702/using-gridsearchcv-with-isolationforest-for-finding-outliers – taga Oct 03 '19 at 09:45

score 1 · Answer 3 · answered Dec 24 '20 at 08:34

1

Parameters you tune are not all necessary.
For example:
contamination is the rate for abnomaly, you can determin the best value after you fitted a model by tune the threshold on model.score_samples

n_jobs is the CPU core you used.

answered Dec 24 '20 at 08:34

Joey Gao

850
2
7
14

Isolation Forest Parameter tuning with gridSearchCV

3 Answers3