3

I need a machine learning algorithm that will satisfy the following requirements:

  • The training data are a set of feature vectors, all belonging to the same, "positive" class (as I cannot produce negative data samples).
  • The test data are some feature vectors which might or might not belong to the positive class.
  • The prediction should be a continuous value, which should indicate the "distance" from the positive samples (i.e. 0 means the test sample clearly belongs to the positive class and 1 means it is clearly negative, but 0.3 means it is somewhat positive)

An example: Let's say that the feature vectors are 2D feature vectors.

Positive training data:

  • (0, 1), (0, 2), (0, 3)

Test data:

  • (0, 10) should be an anomaly, but not a distinct one
  • (1, 0) should be an anomaly, but with higher "rank" than (0, 10)
  • (1, 10) should be an anomaly, with an even higher anomaly "rank"
Andrzej Pronobis
  • 33,828
  • 17
  • 76
  • 92
ido4848
  • 213
  • 3
  • 6
  • The idea is to examine the "distance" from the positive examples (like in anomaly detection). I'm actually looking for an anomaly detection algo in percentages(what is the scale of the anomaly) – ido4848 Jun 12 '16 at 15:44
  • Can you be more specific, e.g. what is your data about? Can you provide some sample input data and what you are expecting as a result? – miraculixx Jun 12 '16 at 15:45
  • @miraculixx i have added an example – ido4848 Jun 12 '16 at 16:17

1 Answers1

2

The problem you described is usually referred to as outlier, anomaly or novelty detection. There are many techniques that can be applied to this problem. A nice survey of novelty detection techniques can be found here. The article gives a thorough classification of the techniques and a brief description of each, but as a start, I will list some of the standard ones:

  • K-nearest neighbors - a simple distance-based method which assumes that normal data samples are close to other normal data samples, while novel samples are located far from the normal points. Python implementation of KNN can be found in ScikitLearn.
  • Mixture models (e.g. Gaussian Mixture Model) - probabilistic models modeling the generative probability density function of the data, for instance using a mixture of Gaussian distributions. Given a set of normal data samples, the goal is to find parameters of a probability distribution so that it describes the samples best. Then, use the probability of a new sample to decide if it belongs to the distribution or is an outlier. ScikitLearn implements Gaussian Mixture Models and uses the Expectation Maximization algorithm to learn them.
  • One-class Support Vector Machine (SVM) - an extension of the standard SVM classifier which tries to find a boundary that separates the normal samples from the unknown novel samples (in the classic approach, the boundary is found by maximizing the margin between the normal samples and the origin of the space, projected to the so called "feature space"). ScikitLearn has an implementation of one-class SVM which allows you to use it easily, and a nice example. I attach the plot of that example to illustrate the boundary one-class SVM finds "around" the normal data samples: enter image description here
Andrzej Pronobis
  • 33,828
  • 17
  • 76
  • 92
  • Regarding mixture models, when you say "use the probability of a new sample to decide if it belongs to the distribution or is an outlier", what probability is that exactly? For example, sci-kit GMM's predict_proba method (http://scikit-learn.org/stable/modules/generated/sklearn.mixture.GaussianMixture.html#sklearn.mixture.GaussianMixture) returns a vector of probabilities that sum to 1. I was hoping a novelty would return a vector with very low probabilities for all components, thus not necessarily summing to 1. – felipeduque Jan 05 '18 at 16:50
  • On 2022, a comprehensive survey that included both traditional and Deep learning methods that I found very informative, is this one https://arxiv.org/abs/1901.03407v2 . By the way, I am not getting any benefit from promoting, I just came through the post and I thought it would be nice to update it :) – cestpasmoi Jan 20 '22 at 07:48