I have a sample of values that don't have a y target value. Actually, the X features (predictors) are all used to fit the Isolation Forest estimator. The goal is to identify which of those X-features and the ones to come in the future are actually outliers. So for example let's say that I fit an array (340,3) => (n_samples, n_features) and I predict those features to identify which of the 340 observations are outliers.
My approach so far is:
First I create a pipeline object
from sklearn.pipeline import Pipeline
from sklearn.ensemble import IsolationForest
from sklearn.model_selection import GridSearchCV
steps=[('IsolationForest', IsolationForest(n_jobs=-1, random_state=123))]
pipeline=Pipeline(steps)
Then I create a parameters grid for the hyperparameter tuning
parameteres_grid={'IsolationForest__n_estimators':[25,50,75],
'IsolationForest__max_samples':[0.25,0.5,0.75,1.0],
'IsolationForest__contamination':[0.01,0.05],
'IsolationForest__bootstrap':[True, False]
}
Finally, I apply the GridSearchCV algorithm
isolation_forest_grid=GridSearchCV(pipeline, parameteres_grid, scoring=scorer_f, cv=3, verbose=2)
isolation_forest_grid.fit(scaled_x_features.values)
My goal is to identify the best fit for a scoring function (noted as scorer_f that would efficiently select the most suitable isolation forest estimator for outlier detection.
So far, and based on this excellent answer, my scorer is as follows:
Scorer Function
def scorer_f(estimator, X):
thresh=np.quantile(estimator.score_samples(X), 0.05)
scores=estimator.score_samples(X)
return len(np.where(scores<thresh)[0])
A brief explanation, I identify constantly the 5% (0.05 quantile) of observations in the batch as outliers. Thus, every score less than the threshold is denoted as an outlier. As a result I instruct the GridSearch function to select the model with the most outliers as a worst-case scenario.
To give you a taste from the results:
isolation_forest_grid.cv_results_['mean_test_score']
array([4. , 4. , 4. , 4. , 4. , 4. , 4. , 4. , 4. , 4. , 4. , 4. , 4. ,
4. , 4. , 4. , 4. , 4. , 4. , 4. , 4. , 4. , 4. , 4. , 3.8, 4. ,
4. , 4. , 4. , 4. , 4. , 4. , 4. , 4. , 4. , 4. , 3.8, 4. , 4. ,
4. , 4. , 4. , 4. , 4. , 4. , 4. , 4. , 4. ])
The GridSearch function randomly selects the model on index 31 as the best model. As you can see most of the model estimators have 4.0 outliers, thus I expect the rest of the selection is done randomly.
Overall, I would like to ask if this approach is valid
(mathematically correct) and can produce valid model estimators for outlier detection. The drawback of outlier detection algorithms is their lack of a scorer metric in sklearn.metrics library. That's why I struggled in finding a good score metric for the GridSearchCV
method.