5

I want to fit a Gaussian mixture model to a set of weighted data points using python.

I tried sklearn.mixture.GMM() which works fine except for the fact that it weights all data points equally. Does anyone know a way to assign weights to the data points in this method? I tried using data points several times to "increase their weight", but this seems ineffective for large datasets.

I also thought about implementing the EM algorithm myself, but this seems to be much slower than e.g. the GMM method above and would extremely increase the computation time for large datasets.

I just discovered the opencv method for the EM algorithm cv2.EM(). This again works fine but has the same problem as sklearn.mixture.GMM and additionally, there seems no way to change the minimum of the values allowed for the covariance. Or is there a way to change the covariance minimum to e.g. 0.001? I hoped that it would be possible to use the probe parameter to assign the weights to the data, but this seems to be just an output parameter and has no influence on the fitting process, doesn't it? Using probs0 and start the algorithm with the M step by using trainM didn't help either. For probs0 I used a (number of datapoint) x (number of GMM components) matrix whose columns are identical while the weighting parameters for the data points are written to the row corresponding to the data point. This didn't solve the problem either. It just resulted in a mixture model where all means where 0.

Does anyone have an idea how to manipulate the methods above or does anyone know another method so that the GMM can be fitted with weighted data?

TylerH
  • 20,799
  • 66
  • 75
  • 101
JaneD
  • 51
  • 1
  • 2
  • GMM can easily be extended to support weights; but you will probably need to modify an implementation for this. I'd go with a Java one such as ELKI: pure python is too slow, Cython not easy to begin with, and C requires a lot of debugging experience. Java is easier, and gives performance just slightly worse than C. But what do you means by "minimum covariance" - why would 0 covariance be bad, and what about negative covariance? – Has QUIT--Anony-Mousse Apr 05 '16 at 21:44
  • 0 covariance is bad because it causes an infinity likelihood, so a model where a mean is placed just on a data point with 0 covariance would achieve the best fit result (maximum likelihood), even if it definitely is not the "correct" solution to describe the data and not what is wanted. Additionally I want to post process the result and therefore it would be nice to determine the minimum of the covariance myself. – JaneD Apr 06 '16 at 07:49
  • No. As long as you have variance. covariance is correlation. – Has QUIT--Anony-Mousse Apr 06 '16 at 07:52
  • 1
    Think I found a way that works quite nice: It is possible to access the code for the class sklearn.mixture.GMM and there the weighting of the data is simply introduced by modifying the function _do_mstep (and of course the functions fit and _fit to make sure this _do_mstep function gets the weights correctly). There one simply needs to multiply the responsibilities by the data weights and that's it :-) – JaneD Apr 11 '16 at 12:38
  • @JaneD, could you please publish the code somewhere? (still working, right?) I'm struggling with the same problem here. – manu May 02 '16 at 10:47

2 Answers2

3

Taking Jacobs suggestion, I coded up a pomegranate implementation example:

import pomegranate
import numpy
import sklearn
import sklearn.datasets 

#-------------------------------------------------------------------------------
#Get data from somewhere (moons data is nice for examples)
Xmoon, ymoon = sklearn.datasets.make_moons(200, shuffle = False, noise=.05, random_state=0)
Moon1 = Xmoon[:100] 
Moon2 = Xmoon[100:] 
MoonsDataSet = Xmoon

#Weight the data from moon2 much higher than moon1:
MoonWeights = numpy.array([numpy.ones(100), numpy.ones(100)*10]).flatten()

#Make the GMM model using pomegranate
model = pomegranate.gmm.GeneralMixtureModel.from_samples(
    pomegranate.MultivariateGaussianDistribution,   #Either single function, or list of functions
    n_components=6,     #Required if single function passed as first arg
    X=MoonsDataSet,     #data format: each row is a point-coordinate, each column is a dimension
    )

#Force the model to train again, using additional fitting parameters
model.fit(
    X=MoonsDataSet,         #data format: each row is a coordinate, each column is a dimension
    weights = MoonWeights,  #List of weights. One for each point-coordinate
    stop_threshold = .001,  #Lower this value to get better fit but take longer. 
                            #   (sklearn likes better/slower fits than pomegrante by default)
    )

#Wrap the model object into a probability density python function 
#   f(x_vector)
def GaussianMixtureModelFunction(Point):
    return model.probability(numpy.atleast_2d( numpy.array(Point) ))

#Plug in a single point to the mixture model and get back a value:
ExampleProbability = GaussianMixtureModelFunction( numpy.array([ 0,0 ]) )
print ('ExampleProbability', ExampleProbability)
D A
  • 3,130
  • 4
  • 25
  • 41
  • https://stackoverflow.com/questions/67137982/how-to-assure-that-the-covariance-matrices-are-all-positive-definite-in-pomigran @D Adams – iforcebd Apr 18 '21 at 12:59
2

If you're still looking for a solution, pomegranate now supports training GMM on weighted data. All you need to do is pass in a vector of weights at training time and it'll handle it for you. Here is a short tutorial on GMMs in pomegranate!

The parent github is here:

https://github.com/jmschrei/pomegranate

The specific tutorial is here:

https://github.com/jmschrei/pomegranate/blob/master/tutorials/B_Model_Tutorial_2_General_Mixture_Models.ipynb

D A
  • 3,130
  • 4
  • 25
  • 41
  • I edited to fix the broken link. I still can't find the concrete example with weights of the DATA, instead of weights on the mixture components. – D A Jun 13 '20 at 22:17