0

I have a training dataset which contains no outliers:

train_vectors.shape
(588649, 896)

And, I have another set of test vectors (test_vectors), and all of them are outliers.

Here is my attempt at doing the outlier detection:

from sklearn.ensemble import IsolationForest
clf = IsolationForest(max_samples=0.01)
clf.fit(train_vectors)
y_pred_train = clf.predict(train_vectors)
print(len(y_pred_train))
print(np.count_nonzero(y_pred_train == 1))
print(np.count_nonzero(y_pred_train == -1))

Output:
 588649
 529771
 58878

So, here the outlier percentage is around 10% which is the default contamination parameter used for Isolation Forests in sklearn. Please note that there aren't any outliers in the training set.

Testing code and results:

y_pred_test = clf.predict(test_vectors)
print(len(y_pred_test))
print(np.count_nonzero(y_pred_test == 1))
print(np.count_nonzero(y_pred_test == -1))

Output:
 100
 83
 17

So, it detects only 17 anomalies out of the 100. Can someone please tell me how to improve the performance. I am not at all sure why the algorithm requires the user to specify the contamination parameter. It is clear to me that it is used as a threshold, but how am I to know beforehand about the contamination level. Thank you!

user1274878
  • 1,275
  • 4
  • 25
  • 56

2 Answers2

1

IsolationForest works a bit differently than what you described :). The contamination is:

The amount of contamination of the data set, i.e. the proportion of outliers in the data set. Used when fitting to define the threshold on the decision function. link

Which means that your train set should contain about 10% of outliers. Ideally, your test set should contain about the same amount of outliers also - and it should not consist of outliers only.

train set and test set proportions
------------------------------------------------
|  normal ~ 90%                  | outliers 10%|
------------------------------------------------

Try to change your dataset proportions as described and try again with the code you posted!

Hope this helps, good luck!

P.S. You can also try OneClassSVM which is trained with the normal instances only - the test set should also be pretty much like above and not only outliers though.

mkaran
  • 2,528
  • 20
  • 23
  • I see. But, how am I to know the outlier percentage beforehand? – user1274878 Jul 12 '17 at 10:32
  • @user1274878 I know I know... you can't really know, but, either you have an estimate, e.g. your outliers are rare because of some assumption, or you have various datasets and know more or less what to expect. In either case, run experiments to evaluate and tune your parameters [more info](https://stackoverflow.com/a/43271326/3433323) Btw, Isolation Forest works on the assumption that your outliers are few and can be easily separated ("few and different"). – mkaran Jul 12 '17 at 11:37
  • tried oneClassSVM as per your suggestion, and it is taking hours for this dataset. Utilizes only a single core and about 90% of the memory. Can you please point me to an efficient implemenation? – user1274878 Jul 15 '17 at 13:49
  • @user1274878 OCSVM is indeed very slow, after doing some research, I have tried different `nu` values and `shrinking` set to `False` - not an impressive improvement though. You can also change the `max_iter`. – mkaran Jul 17 '17 at 09:24
  • @mkaran Thanks for your answer. May I ask you kindly to have a look at related post [here](https://stackoverflow.com/questions/66643736/incorrect-results-of-isolationforest)? – Mario Mar 16 '21 at 12:27
0

Although this question is a couple of years old, I'm posting this for future references and people asking similar questions, as I'm currently in a similar situation.

In the Scikit Learn Documentation it states:

Outlier detection: The training data contains outliers which are defined as observations that are far from the others. Outlier detection estimators thus try to fit the regions where the training data is the most concentrated, ignoring the deviant observations.

Novelty detection: The training data is not polluted by outliers and we are interested in detecting whether a new observation is an outlier. In this context an outlier is also called a novelty.

Judging from this part of the question "(..)here the outlier percentage is around 10% which is the default contamination parameter used for Isolation Forests in sklearn. Please note that there aren't any outliers in the training set." which suggests that what you may want to use is actually Novelty Detection instead.

As @mkaran suggested, OneClassSVM can be used for Novelty Detection, however, since it's somewhat slow, I would suggest anyone in this situation to try to use Local Outlier Factor instead. Also, from sklearn version 0.22, no contamination is needed for IsolationForest algorithm which may be very useful.

tegraze
  • 25
  • 6