0

Question was moved to stats.stackexchange

A Gaussian Mixture model is fitted by the Expectation-Maximization algorithm.

This fairly simple algorithm consists of two steps and the initialization.

  1. Initialization (for k=2 Gaussians) Find (or guess) an initial muand sigma parameter value for both Gaussians.

  2. E-step Use the current Gaussian parameters to estimate for each data point the likelihood that is comes from Gaussian A (or B). Dividing these likelihoods by the sum of both allows to obtain probabilities (or posteriors) that a point is rather from A than from B. Here prior information could be introduced, if the share of the underlying classes is known.

  3. M-step All mus and sigmas are updated. For example, new mu value for Gaussian A can be calculated by the weighted sum of all data points. The weigths are the posteriors of the previous step. Similarly the sigmas can be derived.

I wonder, how I could extend this algorithm to ignore a known share of data points. For example, I have the external knowledge that my dataset consists of p(A)=70% and p(B)=15% and another 15% of unknown species. I want to fit two Gaussians to describe the chestnut and sunflower distribution and want to allow the algorithm to ignore 15% of the data. This allows for smaller sigmas, as the remaining two Gaussians have no need to be stretched to cover data that is from another unknown distribution.

Note: Since the unknown species are possibly multiple species, I can not introduce a third Gaussian to "absorb" it. Also, the unknown species could be similarly distributed as A, so it is not necessarily the best thing to remove outliers.

So far, I use the known share of A and B as priors in the E-Step but I couldn't find an elegant way to ignore the unknown data. Any ideas? Thank you!

Example with 70% species A, 15% species B and 15% unknown

Example with 70% species A, 15% species B and 15% unknown

Klops
  • 951
  • 6
  • 18
  • Interesting question, but it's off topic here since it isn't a specific programming question; try stats.stackexchange.com instead. That said, you could do something like assuming a uniform distribution for the unknown species; I'm pretty sure the effect is going to be that a datum gets assigned to A or B if its probability under A or B is greater than a threshold, and assigned to "unknown" otherwise. – Robert Dodier Jun 29 '23 at 16:09
  • I don't know if widely-available mixture fitting functions allow for different kinds of distributions for the bumps. You might be interested in some code I wrote some time ago which is a pretty general implementation of mixture distribution fitting, in Java. See the `update` method in https://github.com/robert-dodier/riso/blob/master/src/riso/distributions/Mixture.java which references a paper by Ormoneit and Tresp which might be interesting to you. See also my dissertation, https://riso.sourceforge.net/docs/dodier-dissertation.pdf which presents the ideas which are embodied in the code. – Robert Dodier Jun 29 '23 at 16:18
  • I checked on the maths how the update formular is derived for the GMM, I guess this can be done for other distributions, too. Thanks for your dissertation, since I'm half way through mine, I enjoyed to look through parts of it! The implementation as such isn't hard, I have implemented it for the multivariate case with tensorflow distributions which allow the handling of distributions in batches, which makes it very neat. I will carry this answer to stackexchange then, thank you for your input – Klops Jun 29 '23 at 22:58

0 Answers0