6

Running Python 3.7.3

I have made a simple GMM and fit it to some data. Using the predict_proba method, the returns are 1's and 0's, instead of probabilities for the input belonging to each gaussian.

I initially tried this on a bigger data set and then tried to get a minimum example.

from sklearn.mixture import GaussianMixture
import pandas as pd

feat_1 = [1,1.8,4,4.1, 2.2]
feat_2 = [1.4,.9,4,3.9, 2.3]
test_df = pd.DataFrame({'feat_1': feat_1, 'feat_2': feat_2})

gmm_test = GaussianMixture(n_components =2 ).fit(test_df)

gmm_test.predict_proba(test_df) 
gmm_test.predict_proba(np.array([[8,-1]]))

I'm getting arrays that are just 1's and 0's, or almost (10^-30 or whatever).

Unless I'm interpreting something incorrectly, the return should be a probability of each, so for example,

gmm_test.predict_proba(np.array([[8,-1]])) 

should certainly not be [1,0] or [0,1].

desertnaut
  • 57,590
  • 26
  • 140
  • 166
Dave
  • 61
  • 1
  • 2
  • You can also try doing `GaussianMixture(n_components =2, covariance_type='diag')` to prevent overfitting. – user5054 May 27 '21 at 17:28

2 Answers2

4

The example you gave is giving you weird results because you have only 5 data points and still you are using 2 mixture components, which is basically causing overfitting.

If you do check the means and covariances of your components:

print(gmm_test.means_)
>>> [[4.05       3.95      ]
     [1.66666667 1.53333333]]

print(gmm_test.covariances_)
>>> [[[ 0.002501   -0.0025    ]
      [-0.0025      0.002501  ]]
     [[ 0.24888989  0.13777778]
      [ 0.13777778  0.33555656]]]

From this you can see that the first Gaussian is basically fitted with a very small covariance matrix, meaning that unless a point is very close to its center (4.05,3.95), the probability to belong to this Gaussian will always be negligible.

To convince you that despite this, your model is working as expected, try this:

epsilon = 0.005    
print(gmm_test.predict_proba([gmm_test.means_[0]+epsilon]))
>>> array([[0.03142181, 0.96857819]])

As soon as you will increase epsilon, it will only return you array([[0., 1.]]), like you observed.

MaximeKan
  • 4,011
  • 11
  • 26
  • 1
    Oh! Never saw that I got a response. Thanks you. – Dave Aug 08 '19 at 16:00
  • feat_1 = [1,1.8,4,6.1, 2.2, 5, 7,9] feat_2 = [1.4,.9,4,12.9, 2.3, 5, 7, 9] test_df = pd.DataFrame({'feat_1': feat_1, 'feat_2': feat_2}) gmm_test = GaussianMixture(n_components =2 ).fit(test_df) gmm_test.predict_proba(test_df) gmm_test.predict_proba(np.array([[4,7]])) This also gives the same response, so I don't think that's the problem. Here, the covariance is not that small and the prediction point is in between the two gaussians. – Dave Aug 08 '19 at 22:20
0

It might be useful to know that increasing cova_reg will decrease the confidence:

 gmm_test = GaussianMixture(n_components =2,reg_covar=1).fit(test_df)
 # output [[0.56079116 0.43920884]]
Amir
  • 16,067
  • 10
  • 80
  • 119