I have a training dataset which contains no outliers:
train_vectors.shape
(588649, 896)
And, I have another set of test vectors (test_vectors
), and all of them are outliers.
Here is my attempt at doing the outlier detection:
from sklearn.ensemble import IsolationForest
clf = IsolationForest(max_samples=0.01)
clf.fit(train_vectors)
y_pred_train = clf.predict(train_vectors)
print(len(y_pred_train))
print(np.count_nonzero(y_pred_train == 1))
print(np.count_nonzero(y_pred_train == -1))
Output:
588649
529771
58878
So, here the outlier percentage is around 10% which is the default contamination parameter used for Isolation Forests in sklearn. Please note that there aren't any outliers in the training set.
Testing code and results:
y_pred_test = clf.predict(test_vectors)
print(len(y_pred_test))
print(np.count_nonzero(y_pred_test == 1))
print(np.count_nonzero(y_pred_test == -1))
Output:
100
83
17
So, it detects only 17 anomalies out of the 100. Can someone please tell me how to improve the performance. I am not at all sure why the algorithm requires the user to specify the contamination parameter. It is clear to me that it is used as a threshold, but how am I to know beforehand about the contamination level. Thank you!