0

I am trying to build an outlier detector to find outliers in test data. That data varies a bit (more test channels, longer testing).

First im applying the train test split because i wanted to use grid search with train data to get the best results. This is timeseries data from multiple sensors and i removed the time column beforehand.

X shape : (25433, 17)
y shape : (25433, 1)

X_train, X_test, y_train, y_test = train_test_split(X,
                                                    y,
                                                    test_size=0.33,
                                                    random_state=(0))

Standardize afterwards and then i changed them into an int Array because GridSearch doesnt seem to like continuous data. This surely can be done better, but i want this to work before i optimize the coding.

'X'
mean = StandardScaler().fit(X_train)
X_train = mean.transform(X_train)
X_test = mean.transform(X_test)

X_train = np.round(X_train,2)*100
X_train = X_train.astype(int)
X_test = np.round(X_test,2)*100
X_test = X_test.astype(int)

'y'
yeah = StandardScaler().fit(y_train)
y_train = yeah.transform(y_train)
y_test = yeah.transform(y_test)
y_train = np.round(y_train,2)*100
y_train = y_train.astype(int)
y_test = np.round(y_test,2)*100
y_test = y_test.astype(int)

I chose the IsoForrest because its fast, has pretty good results and can handle huge data sets (i currently only use a chunk of the data for testing). SVM might also be an option i want to check out. Then i set up the GridSearchCV

clf = IForest(random_state=47, behaviour='new',
              n_jobs=-1)

param_grid = {'n_estimators': [20,40,70,100], 
              'max_samples': [10,20,40,60], 
              'contamination': [0.1, 0.01, 0.001], 
              'max_features': [5,15,30], 
              'bootstrap': [True, False]}

fbeta = make_scorer(fbeta_score,
                    average = 'micro',
                    needs_proba=True,
                    beta=1)

grid_estimator = model_selection.GridSearchCV(clf, 
                                              param_grid,
                                              scoring=fbeta,
                                              cv=5,
                                              n_jobs=-1,
                                              return_train_score=True,
                                              error_score='raise',
                                              verbose=3)

grid_estimator.fit(X_train, y_train)

The Problem:

GridSearchCV needs an y argument, so i think this only works with supervised learning? If i run this i get the following error that i dont understand:

ValueError: Classification metrics can't handle a mix of multiclass and continuous-multioutput targets
desertnaut
  • 57,590
  • 26
  • 140
  • 166
arooki
  • 1
  • What is the type of `y_train` and the type of `clf.predict`? Are they compatible each other? – Kota Mori Sep 27 '22 at 13:07
  • `y_train` is an 2D-Array of int32 and `clf.predict` is a method of the iForest. This definitely should work together as i already used the iForrest without GridSearchCV. – arooki Sep 28 '22 at 08:08
  • 1
    Okay. You should provide a reproducible example. Currently, the code is incomplete because it does not have `X` and `y` are not given and misses import lines. – Kota Mori Sep 28 '22 at 10:10
  • We need a bit more information. You say you're doing unsupervised learning, but you have targets `y`, which are continuous. You try to use Fbeta, which is a (hard) classification metric, and you try to pass it probability scores. What are you actually trying to accomplish, and how do you measure success? – Ben Reiniger Sep 29 '22 at 00:51
  • Im not allowed to make the data public... ill try to provide as much info as possible. The Data it float, multimodal and has a range between -0,8 and 40.000. I used the y target because GridSearch would thow an missing y_true label error at me. Thats why im asking if GridSearch can only be used for supervised learning. – arooki Sep 29 '22 at 10:15
  • Because i think GridSearch needs a target (y_true) i choose y to be a single column of the data (Battery State of Charge) and X represents the rest of the data. Im not against supervised learning, but i couldnt find a suitable algorithm to handle huge data with decent performance. Why is it important to show where the data comes from when i just want to know if GridSearch is capable of supervised learning or not. – arooki Sep 29 '22 at 10:27
  • The sucess is measured by my f1_scorer and shown by an confusion matrix. If theres a good balance between precision and recall im happy – arooki Sep 29 '22 at 10:38
  • How can you measure f1 or precision or recall, or produce a confusion matrix, if you don't have an actual target? // You can use grid search for unsupervised learning, if you can provide a scoring metric; but so far you haven't provided one that will work with your data (as the error message attests). Just throwing a column into `y` is not a solution. – Ben Reiniger Sep 29 '22 at 12:20
  • i didnt knew about the missing scoring function, thank you ! And the batery_soc is the most important value in my data because it should not leave a certain range. when i declare a fitting scoring function, how and where do i pass it ? – arooki Sep 30 '22 at 08:01
  • @BenReiniger i dont know how to build a scoring metric for this problem, do you have any advice or info that can help me to develop one ? Im sorry that this post may be confusing, im learning by myself and im obviously missing something. – arooki Oct 25 '22 at 11:17

2 Answers2

2

You can use GridSearchCV for unsupervised learning, but it's often tricky to define a scoring metric that makes sense for the problem.

Here's an example in the docs that uses grid search for KernelDensity, an unsupervised estimator. It works without issue because this estimator has a score method (docs).

In your case, since IsolationForest doesn't have a score method, you'll need to define a custom scorer to pass as the search's scoring method. There's an answer at this question, and also this question, but I don't think the metrics given there necessarily makes sense. Unfortunately, I don't have a useful outlier detection metric in mind; that's a question better suited for the data science or statistics stackexchange sites.

Ben Reiniger
  • 10,517
  • 3
  • 16
  • 29
0

Agree with @Ben Reiniger's answer and it has good links for other SO posts on this topic.
You can try creating a custom scorer by assuming you can make use of y_train . This is not strictly unsupervised .

Here is one example where R2 score is used as a scoring metric.

from sklearn.metrics import r2_score

def scorer_f(estimator, X_train,Y_train):
  y_pred=estimator.predict(Xtrain)
  return r2_score(Y_train, y_pred)

Then you can use it as normal.

clf = IForest(random_state=47, behaviour='new',
              n_jobs=-1)

param_grid = {'n_estimators': [20,40,70,100], 
              'max_samples': [10,20,40,60], 
              'contamination': [0.1, 0.01, 0.001], 
              'max_features': [5,15,30], 
              'bootstrap': [True, False]}

grid_estimator = model_selection.GridSearchCV(clf, 
                                              param_grid,
                                              scoring=scorer_f,
                                              cv=5,
                                              n_jobs=-1,
                                              return_train_score=True,
                                              error_score='raise',
                                              verbose=3)

grid_estimator.fit(X_train, y_train)
Gaurav Chawla
  • 368
  • 4
  • 6