1

I train a OneClassSVM for anomaly detection using GridSearchCV for hyperparameters tuning.

What I do is 1-fold cross validation, passing it my class of interest to train on for each HP configuration and a mix of my class of interest and other classes for validation. I set "refit=False" in GridSearchCV as I don't want it to retrain on everything (all the observations of my class of interest plus the rest).

The results of the HP tuning gives me a best metric.

After that, for sake of verification, I train a OneClassSVM without GridSearchCV with a simple model.fit() with the train set passed to GridSearchCV, and evaluate it on the same validation set I had passed it too. This gives me a slightly different metric.

So my question is: Is there some randomness in OneClassSVM? I saw from old versions of the SkLearn doc that this model had a "random_state" parameter, which is not available anymore. I thought that this parameter coupled with "max_iter=-1" could maybe be the cause of this none repeatability.

I triple-checked my code for folds creation, etc... so I don't think it is a mistake on this part.

Below is an example of my code:

# Instantiation of a PCA
pca = PCA()

# Instantiation of the StandardScaler
scaler = StandardScaler()

# Numerical variables
numeric_features = X.select_dtypes([np.number]).columns

# Instantiation of the preprocessor
preprocessor = ColumnTransformer(
    transformers=[
        ("scaling", scaler, numeric_features)
    ]
)

# Creation of the pipeline
pipeline = Pipeline(steps=[
    ("preprocessor", preprocessor),
    ("pca", pca),
    ("estimator", OneClassSVM())
])

# Definition of the models and hyperparameters configurations to try
parameters = [
    {
        "pca__n_components": [3, 5, 7],
        "estimator__kernel": ["linear", "poly", "sigmoid"],
        "estimator__degree": [2, 3, 4, 5],
        "estimator__gamma": ["scale", "auto"],
        "estimator__nu": [0.01, 0.05, 0.1],  
        "estimator__max_iter": [-1]
    }
]   
                       
# Hyper-parameters optimization
grid_search = GridSearchCV(pipeline, parameters, cv=folds_indices, scoring="f1_weighted", n_jobs=1, refit=False, return_train_score=False, verbose=3)
grid_search.fit(X, y)
print("\nBest score is:")
print(f"{grid_search.best_score_:.4f}")
print("\n")
print("Obtained with hyperparameters:")
print(grid_search.best_params_)
print("\n")


# Instantiation of a OneClassSVM model with the best parameters found
model = OneClassSVM(
                    degree=grid_search.best_params_["estimator__degree"],
                    gamma=grid_search.best_params_["estimator__gamma"],  
                    kernel=grid_search.best_params_["estimator__kernel"], 
                    nu=grid_search.best_params_["estimator__nu"],
                    max_iter=grid_search.best_params_["estimator__max_iter"]
                    )

With X being my features and y the observations label. X contains both train and validation observations which are indexed in the GridSearch via "cv=folds_indices", which is a tuple. I don't set "cv" to a integer because it is a OneClass model. Doing so would train my model on the validation set as well which contains a mix of classes.

I also set "refit" to False because I don't want to train on all the train+validation data at the end.

In the end, I create a model from scratch with the best conf of HP found with GridSearch. I then train this model on the same data that were used for training in the GridSearch and evaluate it with ".predict()" on the validation set that was used for validation in GridSearch. Doing so gives me different results at each run. I did fit() + predict() several times and I get slightly different results each times, the HP and data sets being the same.

When checking "model.n_iter_", I see that this number changes. I thought that maybe OneClassSVM shuffles data and/or processes them by batch iteratively, thus causing different conditions each time. Fixing "max_iter" to a defined number doesn't fix my problem (metrics still change).

Many thanks in advance

Cheers

Antoine

Antoine101
  • 167
  • 1
  • 8
  • 1
    Could you add a minimal reproducible example? – nscholten Jul 28 '23 at 12:55
  • 2
    And what do you mean by 1-fold cross-validation? Are you referring to leave-one-out cross-validation? u can not use cv=1 with GridSearchCV. – nscholten Jul 28 '23 at 13:11
  • To me it sounds like you have two classifiers - one that has been tuned using grid search, and one that hasn't been tuned. They will give different results as they'll be configured with different hyperparameters. The tuned one should yield a better score. Not sure if I have understood the question as you intended. – some3128 Jul 28 '23 at 14:23
  • 1
    Hi guys, thanks for replying. I edited my post, adding my block of code plus some additional context. Hope this is enough. – Antoine101 Jul 28 '23 at 16:20
  • No problem. I've posted an answer below that I think addresses the differing results. – some3128 Jul 29 '23 at 21:25

1 Answers1

0

Depending on the size of your dataset, PCA() may switch to a randomised PCA algorithm. If this happens then you'll see some variability in PCA results each time you run it. To eliminate this variability, set the random_state= parameter in PCA() to a constant. You could alternatively force PCA() to stick with a non-stochastic algorithm by setting its svd_solver= parameter.

some3128
  • 1,430
  • 1
  • 2
  • 8
  • 1
    Just tried setting the random_state and I get repeatable results now. Thanks a lot! I was not looking at the right thing. Still, OneClassSVM has an "n_iter" parameter and previously had a "random_state" parameter as well (in previous versions of SkLearn). I am wondering how it works internally. Cheers anyway. – Antoine101 Jul 31 '23 at 15:51
  • It seems as though previous versions would shuffle the data, and you could make the shuffling repeatable by setting `random_state`. The `max_iter` is because the algorithm gradually takes steps (iterations) towards a solution, rather than being able to find a solution in one go. Two use-cases for `max_iter` are: [1] stopping the solver if it's taking too long. [2] Preventing the solver from overfitting by not giving it the opportunity to perfectly fit your data. This makes the solution less accurate for your train data, but that ambiguity may also result in a more robust and general solution. – some3128 Jul 31 '23 at 16:33