1

I am using below code to predict anomaly detection. It is a binary classification so the confusion matrix should be 2x2 instead it is 3x3. There are extra zeros appended in T-shape. Similar thing happened using OneClassSVM few weeks back as well but I thought I was doing something wrong. Could you please help me fix this?

import numpy as np
import pandas as pd
import os
from sklearn.ensemble import IsolationForest
from sklearn.metrics import confusion_matrix, accuracy_score, classification_report 
from sklearn import metrics
from sklearn.metrics import roc_auc_score

data = pd.read_csv('opensky_train.csv')

#to make sure that normal data contains no anomaly
sortedData = data.sort_values(by=['class'])
target = pd.DataFrame(sortedData['class'])

Y = target.replace(['surveill', 'other'], [1,0])
X = sortedData.drop(['class'], axis = 1)

x_normal = X.iloc[:200,:]
y_normal = Y.iloc[:200,:]
x_anomaly = X.iloc[200:,:]
y_anomaly = Y.iloc[200:,:]

Edited:

column_values = y_anomaly.values.ravel()
unique_values =  pd.unique(column_values)
print(unique_values)

Output : [0 1]

clf = IsolationForest(random_state=0).fit(x_normal)
pred = clf.predict(x_anomaly)

print(pred)

Output : [ 1 1 1 1 1 1 -1 1 -1 1 1 1 1 1 1 1 1 1 1 -1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 -1 1 1 1 1 1 1 -1 1 1 -1 1 1 -1 1 1 -1 1 -1 1 -1 1 1 -1 -1 1 -1 -1 1 1 1 1 -1 1 1 -1 -1 1 1 1 1 1 1 1 -1 1 1 1 1 1 1 1 1 1 -1]

#printing the results 
print(confusion_matrix(y_anomaly, pred))
print (classification_report(y_anomaly, pred))  

Result:

Confusion Matrix :
[[ 0  0  0]
 [ 7  0 60]
 [12  0 28]]
              precision    recall  f1-score   support

          -1       0.00      0.00      0.00         0
           0       0.00      0.00      0.00        67
           1       0.32      0.70      0.44        40

    accuracy                           0.26       107
   macro avg       0.11      0.23      0.15       107
weighted avg       0.12      0.26      0.16       107
Vers
  • 69
  • 7
  • 1
    We don't have your data, so we cannot be sure if this is indeed a binary set; please update your question to **show with code** the unique values of your `y_anomaly` and `pred`. See [this thread](https://stackoverflow.com/questions/12897374/get-unique-values-from-a-list-in-python) if you need help with this task. From the confusion matrix, it seems that there are three unique values - `-1, 0, 1`. – desertnaut Apr 12 '20 at 00:54
  • 1
    Please show the `y_anomaly` and `pred` in your code, so we can help you. – Henrique Branco Apr 12 '20 at 00:57
  • Done! Please have a look and let me know if you need any other information. Thanks – Vers Apr 12 '20 at 22:43

1 Answers1

1

Inliers are labeled 1, while outliers are labeled -1

Source: scikit-learn Anomaly and Outlier detection.

Your example has transformed the classes to 0,1 - so the three possible options are -1,0,1

You need to change from

Y = target.replace(['surveill', 'other'], [1,0])

to

Y = target.replace(['surveill', 'other'], [1,-1])
Jon Nordby
  • 5,494
  • 1
  • 21
  • 50
  • I'll update question with my the unique target and predicted. Please have a look and help me what those 3 values can be coz my problem is of binary classification. So there shouldn't be third value at all – Vers Apr 12 '20 at 15:54
  • @Vers commenting for saying that you *will* update is not very useful; pls leave a comment when you have done so – desertnaut Apr 12 '20 at 16:09
  • @desertnaut Hi! Sorry about that, I had classes so couldn't update that time. I'll keep that in mind. – Vers Apr 12 '20 at 22:44
  • 1
    @Vers, I have added the solution a bit more explicitly in my answer now – Jon Nordby Apr 13 '20 at 11:29
  • Oh I see. So every time we use any outlier detection algo, we need to change the classifier targets from 0/1 to -1 and 1? Is there any other automatic way it picks up correctly? Coz sometimes the dataset already has 0s and 1s in it. – Vers Apr 13 '20 at 17:36
  • If you got 0,1 then you replace that with -1,1 in exactly the same way – Jon Nordby Apr 13 '20 at 17:58