2

I have 5000 data points for each of my 17 features in a numpy array resulting in a 5000 x 17 array. I am trying to find the outliers for each feature using Gaussian mixture and I am rather confused on the following: 1)how many components should I use for my GaussiasnMixture? 2) Should I fit the GaussianMixture directly on the array of 5000 x 17 or to each feature column seperately resulting in 17 GaussianMixture models?

clf = mixture.GaussianMixture(n_components=1, covariance_type='full')
clf.fit(full_feature_array)

or

clf = mixture.GaussianMixture(n_components=17, covariance_type='full')
clf.fit(full_feature_array)

or

for feature in range(0, full_feature_matrix):
    clf[feature] = mixture.GaussianMixture(n_components=1, covariance_type='full')
    clf.fit(full_feature_array[:,feature)
azal
  • 1,210
  • 6
  • 23
  • 43

2 Answers2

4

The task of selecting the number of components to model a distribution with a Gaussian mixture model is an instance of Model Selection. This is not so straightforward and there exist many approaches. A good summary can be found here https://en.m.wikipedia.org/wiki/Model_selection . One of the simplest and most widely used is to perform cross validation.

Normally outliers can be determined as those belonging to the component or components with the largest variance. You would call this strategy an unsupervised approach, however it can still be difficult to decide what the cutoff variance should be. A better approach (if applicable) is a supervised approach where you would train the GMM with outlier-free data (by manually removing outliers). You then use this to classify outliers as those which have particularly low likelihood scores. The second way to do it with a supervised approach would be to train two GMMs (one for outliers and one for inliers using model selection) then perform two-class classification for new data. Regarding your question about training univariate versus multivariate GMMs - it's difficult to say but for the purposes of outlier detection univariate GMMs ( or equivalently multivariate GMMs with diagonal covariance matrices) may be sufficient and require training fewer parameters compared to general multivariate GMMs, so I would start with that.

Toby Collins
  • 823
  • 5
  • 8
  • still this answer doesn't quite answer my question: no supervised approach is possible and the answer is quite general. – azal Jan 09 '18 at 17:57
  • The reason why you don't (and won't ever) have a concrete single solution answer with unsupervised outlier detection is that there exists no single best model selection method. I mentioned cross validation as it a common one and easy to apply. My answer about detecting outliers as those which belong to the highest variance modes is as good as you can do without any extra information. The cutoff points for deciding between outliers and inliers is a trade-off between false positive and false detect rates. You generally have to set this decision point yourself. – Toby Collins Jan 09 '18 at 18:13
  • Correction: "false detect" should be "false negative" – Toby Collins Jan 09 '18 at 18:36
  • And a final point. There is an assumption I am making that the outliers are more spread in your vector space than inliers. This is commonly the case in most situations but not all. It depends therefore on you having some prior knowledge on the inlier/outlier distributions. You can also consider outliers as points which are in some sense 'isolated', and you can try to detect these using unsupervised clustering such as K means, then classifying outliers as those which belong to clusters with fewer than N points. Again without supervised training you have to guess at the good values for K and N. – Toby Collins Jan 09 '18 at 19:28
  • If you are not statisfied with my answer you can try to post on Cross Validated, but I'm certain you will receive something similar. – Toby Collins Jan 16 '18 at 08:47
1

Using Gaussian Mixture Model (GMM) any point sitting on low-density area can be considered outlier - Perhaps the challenge is how to define low density area - For example you can say whatever lower than 4th quantile density is outlier.

densities = gm.score_samples(X)
density_threshold = np.percentile(densities, 4)
anomalies = X[densities < density_threshold]

regarding choosing the number of component - look into "information criterion" provided by AIC or BIC given different number of components - they often agree in such cases. The lowest is better.

gm.bic(x)
gm.aic(x)

alternatively, BayesianGaussianMixture gives zero as weight to those clusters that are unnecessary.

from sklearn.mixture import BayesianGaussianMixture
bgm = BayesianGaussianMixture(n_components=8, n_init=10) # n_components should be large enough
bgm.fit(X)
np.round(bgm.weights_, 2)

output

array([0.5 , 0.3, 0.2 , 0. , 0. , 0. , 0. , 0. ])

so here it the bayesian gmm detected there are three clusters.

Areza
  • 5,623
  • 7
  • 48
  • 79