6

My question is regarding the Novelty detection algorithms - Isolation Forest and One Class SVM. I have a training dataset(with 4-5 features) where all the sample points are inliers and I need to classify any new data as an inlier or outlier and ingest in another dataframe accordingly.

While trying to use Isolation Forest or One Class SVM, i have to input the contamination percentage(nu) during the training phase. However as the training dataset doesn't have any contamination, do I need to add outliers to the training dataframe and put that outlier fraction as nu.

Also while using the Isolation forest, I noticed that the outlier percentage changes everytime I predict, even though i don't change the model. Is there a way to take care of this problem apart from going into the Extended Isolation Forest algorithm.

Thanks in advance.

2 Answers2

5

Regarding contamination for isolation forest,

If you are training for the normal instances (all inliers), you should put zero for contamination. If you don't specify this, contamination would be 0.1 (for version 0.2).

The following is a simple code to show this,

1- Import libraries

import numpy as np
import matplotlib.pyplot as plt
from sklearn.ensemble import IsolationForest
rng = np.random.RandomState(42)

2- Generate a 2D dataset

X = 0.3 * rng.randn(1000, 2)

3- Train iForest model and predict the outliers

clf = IsolationForest(random_state=rng, contamination=0)
clf.fit(X)
y_pred_train = clf.predict(X)  

4- Print # of anomalies

print(sum(y_pred_train==-1))

This would give you 0 anomalies. Now if you change the contamination to 0.15, the program specifies 150 anomalies out of the same dataset you already had (same because of RandomState(42)).

[References]:

1 Liu, Fei Tony, Ting, Kai Ming and Zhou, Zhi-Hua. "Isolation forest." Data Mining, 2008. ICDM'08. Eighth IEEE International Conference

2 Liu, Fei Tony, Ting, Kai Ming and Zhou, Zhi-Hua. "Isolation-based anomaly detection." ACM Transactions on Knowledge Discovery from Data (TKDD), (2012)

  • 1
    Thanks for replying. I do understand that on putting the contamination factor=0 , the model will be trained so that no anomalies are detected on the training dataset which is exactly the ground truth. However my objective is when I use this model on another new dataset, it should be able to figure out what are the outliers and the inliers in this new dataset with regards to my training data. Since the trained model has been trained with contamination=0, it concludes that every sample in the new dataset is also an inlier and the anomaly detection algorithm fails. How can I resolve this issue? – subhadeep sarkar Oct 17 '19 at 23:43
2

"Training with normal data(inliers) only".

This is against the nature of Isolation Forest. The training is here completely different than training in the Neural Networks. Because everyone is using these without clarifying what is going on, and writing blogs with 20% of ML knowledge, we are having questions like this.

clf = IsolationForest(random_state=rng, contamination=0)
clf.fit(X)

What does fit do here? Is it training? If yes, what is trained?

In Isolation Forest:

  1. First, we build trees,
  2. Then, we pass each data point through each tree,
  3. Then, we calculate the average path that is required to isolate the point.
  4. The shorter the path, the higher the anomaly score.

contamination will determine your threshold. if it is 0, then what is your threshold?

Please read the original paper first to understand the logic behind it. Not all anomaly detection algorithms suit for every occasion.

Mr. Panda
  • 485
  • 3
  • 14