6

I am currently working on detecting outliers in my dataset using Isolation Forest in Python and I did not completely understand the example and explanation given in scikit-learn documentation

Is it possible to use Isolation Forest to detect outliers in my dataset that has 258 rows and 10 columns?

Do I need a separate dataset to train the model? If yes, is it necessary to have that training dataset free from outliers?

This is my code:

rng = np.random.RandomState(42)
X = 0.3*rng.randn(100,2)
X_train = np.r_[X+2,X-2]
clf = IsolationForest(max_samples=100, random_state=rng, contamination='auto'
clf.fit(X_train)
y_pred_train = clf.predict(x_train)
y_pred_test = clf.predict(x_test)
print(len(y_pred_train))

I tried by loading my dataset to X_train but that does not seem to work.

davidrpugh
  • 4,363
  • 5
  • 32
  • 46
Nnn
  • 191
  • 3
  • 9
  • Your code is working for your toy example with minor corrections. If you have problems with running `IsolationForest` on your dataset show it to us with all the preprocessing steps you've done and error message that you have – Sergey Bushmanov Feb 18 '19 at 07:19
  • Do you have ground truth labels for your "outliers"? – davidrpugh Feb 18 '19 at 07:20
  • 1
    @davidrpugh You do not need any "ground truth" for `IsolationForest`, the rationale behind it is different... – Sergey Bushmanov Feb 18 '19 at 07:23
  • @SergeyBushmanov I understand that ground truth labels are not needed in order to use `IsolationForest` however if OP has such labels, then you could use this information to tune hyperparameters or score `IsolationForest` on test data for comparison with other models. – davidrpugh Feb 18 '19 at 10:00

1 Answers1

8

Do I need a separate dataset to train the model?

Short answer is "No". You train and predict outliers on the same data.

IsolationForest is an unsupervised learning algorithm that's intended to clean your data from outliers (see docs for more). In usual machine learning settings, you would run it to clean your training dataset. As far as your toy example concerned:

rng = np.random.RandomState(42)
X = 0.3*rng.randn(100,2)
X_train = np.r_[X+2,X-2]

from sklearn.ensemble import IsolationForest
clf = IsolationForest(max_samples=100, random_state=rng, behaviour="new", contamination=.1)

clf.fit(X_train)
y_pred_train = clf.predict(X_train)
y_pred_train
array([ 1,  1,  1, -1,  1,  1,  1,  1,  1,  1, -1,  1,  1,  1,  1,  1,  1,
        1, -1,  1,  1,  1,  1,  1, -1,  1,  1,  1,  1,  1,  1,  1,  1,  1,
        1,  1,  1, -1,  1, -1,  1, -1,  1,  1,  1,  1,  1,  1,  1,  1,  1,
        1,  1, -1,  1, -1,  1,  1,  1,  1,  1, -1, -1,  1,  1,  1,  1,  1,
        1,  1,  1,  1,  1,  1,  1,  1,  1,  1, -1,  1,  1,  1,  1,  1,  1,
        1,  1,  1,  1, -1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,
        1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,
        1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,
        1, -1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,
       -1,  1,  1, -1,  1,  1,  1,  1, -1, -1,  1,  1,  1,  1,  1,  1,  1,
        1,  1,  1,  1,  1,  1,  1,  1, -1,  1,  1,  1,  1,  1,  1,  1,  1,
        1,  1, -1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1])

where 1 represent inliers and -1 represent outliers. As specified by contamination param, the fraction of outliers is 0.1.

Finally, you would remove outliers like:

X_train_cleaned = X_train[np.where(y_pred_train == 1, True, False)]
Sergey Bushmanov
  • 23,310
  • 7
  • 53
  • 72