Good Anomaly Detection Model for a Complicated Data

Question

I am working on data and want to produce an Anomaly Detection model for this data. The data contains only three features: Latitude, Longitude and Speed. I normalized it and then applied t-SNE then normalized again. There is no labeled or target data. So, it should be an unsupervised anomaly detection.

I cannot share the data since it is private. But, it seems like this:

There are some abnormal values in the data such as abnormal values:

Here's the final shape of the data:

As you can see, the data is a bit complicated. When I searched for abnormal instances manually (by looking at feature values), I observed that the instances inside the red circle (in the below image) should be detected as anomalies.

The instances inside the red region should be abnormal:

I used OneClassSVM to detect anomalies. Here are the parameters;

nu = 0.02
kernel = "rbf"
gamma = 0.1
degree = 3
verbose = False
random_state = rng

And the model;

# fit the model
clf = svm.OneClassSVM(nu=nu, kernel=kernel, gamma=gamma, verbose=verbose, random_state=random_state)
clf.fit(data_scaled)
y_pred_train = clf.predict(data_scaled)
n_error_train = y_pred_train[y_pred_train == -1].size

Here is what I obtained at the end:

Here is the detected anomalies of OneClassSVM and red instances were detected as anomalies:

So, as you can see, the model predicted many instances as anomalies, but in reality, most of these instances should be normal.

I tried different parameter values for nu, gamma and degree. However, I could not find a suitable decision line to detect only real anomalies.

What is wrong with my model? Should I try a different anomaly detection algorithm?
Is not my data appropriate for anomaly detection?

I cannot share the data since it is private. But, I added a small portion of it. Thanks. — M.Arıcı, Apr 18 '18 at 08:57

Bert Kellerman · Accepted Answer · 2018-04-18T12:52:01.877

It appears some of the anomalies reported by One-class SVM are global but not local anomalies. You might want to try Local Outlier Factor.

It will consider the local structure of your data. So the original outliers on the left side which are part of small clusters should not be as anomalous.

http://scikit-learn.org/stable/modules/generated/sklearn.neighbors.LocalOutlierFactor.html

# fit the model
clf = LocalOutlierFactor()
y_pred_train = clf.fit_predict(data_scaled)
n_error_train = y_pred_train[y_pred_train == -1].size

I would also try Isolation Forest and try tweaking the contamination ratio. You don't have to scale your data for IF and I suspect you might not want to here.

http://scikit-learn.org/stable/modules/generated/sklearn.ensemble.IsolationForest.html#sklearn.ensemble.IsolationForest.predict

# fit the model
clf = IsolationForest(contamination=0.01)
clf.fit(data)
y_pred_train = clf.predict(data)
n_error_train = y_pred_train[y_pred_train == -1].size

Local Outlier Factor worked well. The important point is that previously I was using the output of t-SNE as the input of LOF and it did not work. When I used all features (no dimensionality reduction), LOF successfully differenciated anomalous samples from normal samples. Thanks. — M.Arıcı, Apr 23 '18 at 20:27
@Bert Kellerman Thanks for your answer. May I ask you kindly to have a look at related post [here](https://stackoverflow.com/questions/66643736/incorrect-results-of-isolationforest)? — Mario, Mar 16 '21 at 11:08

Good Anomaly Detection Model for a Complicated Data

1 Answers1