5

I have a sample of values that don't have a y target value. Actually, the X features (predictors) are all used to fit the Isolation Forest estimator. The goal is to identify which of those X-features and the ones to come in the future are actually outliers. So for example let's say that I fit an array (340,3) => (n_samples, n_features) and I predict those features to identify which of the 340 observations are outliers.

My approach so far is:

First I create a pipeline object

from sklearn.pipeline import Pipeline
from sklearn.ensemble import IsolationForest
from sklearn.model_selection import GridSearchCV

steps=[('IsolationForest', IsolationForest(n_jobs=-1, random_state=123))]
pipeline=Pipeline(steps)

Then I create a parameters grid for the hyperparameter tuning

parameteres_grid={'IsolationForest__n_estimators':[25,50,75],
                  'IsolationForest__max_samples':[0.25,0.5,0.75,1.0],
                  'IsolationForest__contamination':[0.01,0.05],
                  'IsolationForest__bootstrap':[True, False]
                 }

Finally, I apply the GridSearchCV algorithm

isolation_forest_grid=GridSearchCV(pipeline, parameteres_grid, scoring=scorer_f, cv=3, verbose=2)
isolation_forest_grid.fit(scaled_x_features.values)

My goal is to identify the best fit for a scoring function (noted as scorer_f that would efficiently select the most suitable isolation forest estimator for outlier detection.

So far, and based on this excellent answer, my scorer is as follows:

Scorer Function

def scorer_f(estimator, X):
  thresh=np.quantile(estimator.score_samples(X), 0.05)
  scores=estimator.score_samples(X)
  return len(np.where(scores<thresh)[0])

A brief explanation, I identify constantly the 5% (0.05 quantile) of observations in the batch as outliers. Thus, every score less than the threshold is denoted as an outlier. As a result I instruct the GridSearch function to select the model with the most outliers as a worst-case scenario.

To give you a taste from the results:

isolation_forest_grid.cv_results_['mean_test_score']

array([4. , 4. , 4. , 4. , 4. , 4. , 4. , 4. , 4. , 4. , 4. , 4. , 4. ,
       4. , 4. , 4. , 4. , 4. , 4. , 4. , 4. , 4. , 4. , 4. , 3.8, 4. ,
       4. , 4. , 4. , 4. , 4. , 4. , 4. , 4. , 4. , 4. , 3.8, 4. , 4. ,
       4. , 4. , 4. , 4. , 4. , 4. , 4. , 4. , 4. ])

The GridSearch function randomly selects the model on index 31 as the best model. As you can see most of the model estimators have 4.0 outliers, thus I expect the rest of the selection is done randomly.

Overall, I would like to ask if this approach is valid (mathematically correct) and can produce valid model estimators for outlier detection. The drawback of outlier detection algorithms is their lack of a scorer metric in sklearn.metrics library. That's why I struggled in finding a good score metric for the GridSearchCV method.

NikSp
  • 1,262
  • 2
  • 19
  • 42
  • "Overall, I would like to ask if this approach is valid (mathematically correct) and can produce valid model estimators for outlier detection." is not a programming question, and so off-topic here. – Ben Reiniger Jun 03 '22 at 13:31
  • @BenReiniger I would like to point that *mathematically* refers to the python code behind the mathematic representation of the scorer function. – NikSp Jun 03 '22 at 13:45
  • Anyway, your scorer sets 5% of the data as outliers, with the only differing scores apparently coming from ties. I doubt that's a particularly useful metric. – Ben Reiniger Jun 03 '22 at 23:42
  • @BenReiniger Do you know any other more useful metric that could help me better assess the model estimators during the GridSearch? – NikSp Jun 04 '22 at 06:46
  • I do not; I have some ideas, but don't know if there are any established ones. That would make a good question over at stats.SE or datascience.SE. – Ben Reiniger Jun 04 '22 at 14:37
  • @BenReiniger I posted the question on datascience.SE https://datascience.stackexchange.com/questions/111597/evaluate-multiple-isolation-forest-estimators-during-gridsearchcv-with-custom-sc – NikSp Jun 06 '22 at 07:34
  • @BenReiniger If you are familiar with Python coding you can write your ideas and we will stretch them if any good one arises. – NikSp Jun 07 '22 at 07:29
  • Any update on this issue? – NikSp Jul 07 '22 at 09:33

0 Answers0