5

I am interested in fitting a 2-component Gaussian Mixture Model to the data shown below.log-transformed count ratio data, can't go above 0 However, since what I am plotting here are log-transformed counts normalized to be between 0-1, the maximum value my data will ever take is 0. When I try a naive fit using sklearn.mixture.GaussianMixture (code below), I get the resulting fit, which is obviously not what I want.

from sklearn.mixture import GaussianMixture
import numpy as np

# start with some count data in (0,1]
logged_counts = np.log(counts)
model = GaussianMixture(2).fit(logged_counts.reshape(-1,1))

# plot resulting fit
x_range = np.linspace(np.min(logged_counts), 0, 1000)
pdf = np.exp(model.score_samples(x_range.reshape(-1, 1)))
responsibilities = model.predict_proba(x_range.reshape(-1, 1))
pdf_individual = responsibilities * pdf[:, np.newaxis]

plt.hist(logged_counts, bins='auto', density=True, histtype='stepfilled', alpha=0.5)
plt.plot(x_range, pdf, '-k', label='Mixture')
plt.plot(x_range, pdf_individual, '--k', label='Components')
plt.legend()
plt.show()

fit using two-component GMM from sklearn I would love it if I could fix the mean of the top component at 0 and only optimize the other mean, the two variances, and the mixing fractions. (Additionally I would love to be able to use a half-normal for the component on the right.) Is there a simple way to do this with built-in functions in python/sklearn, or will I have to build that model myself using some probabilistic programming language?

Benjamin Doughty
  • 445
  • 5
  • 12

2 Answers2

4

Afaik, you cannot do exactly what you want in sklearn.

Imho, basically there are multiple strategies: (i) implement GMM yourself, (ii) switch to another language/framework, (iii) adapt GMM code, or (iv) adapt.


(i) You probably do not want to do this unless you want to learn for yourself.


(ii) You could use stan and adapt the code in the last paragraph to have a fixed component of your choice (distribution type and parameters)


(iii) You could do (i) but slightly adapt the sklearn code or simply use the methods for estimation but with you own slight modifications.


(iv)

  • Gaussian Mixture model will not work here (as you mentioned) because you require a truncated Normal distribution for the "first" (fixed) component.
  • If you would not require to fit for the variance of the fixed component then you can always just substract your fixed component from the data. (i.e. for each point subtract the point's quantile-value from the point value)
  • If you don't mind the precision in the estimate, you can make two passed: First use GMM to identify both components. Then look only at the data from the component you want to be fixed. Fit a truncated gaussian model (use .fit(data)). Then subtract the resulting parameters from your original data (like in option 2). And then fit a GMM. to find out the next component.

Hope this helps :-)

desertnaut
  • 57,590
  • 26
  • 140
  • 166
Drey
  • 3,314
  • 2
  • 21
  • 26
1

Sklearn provides the possibility of fixing the mean (a.k. "location") for single distributions, as illustrated e.g. in this other answer. The means of doing that is by providing the floc parameter to the fit method (stands for "fixed location").

However, as Drey mentioned, that is not possible for a GMM. If we take a closer look at the code, we can see that GaussianMixture extends BaseMixture. And when we look at the corresponding fit method, we observe that it performs an Expectation-Maximization algorithm, and it does not admit anything from the likes of a fixed outcome.

Adding this functionality to the existing code may involve heavy wrangling with the EM implementation, and it will probably cause more problems than anything else.

This said, it does seem indeed that a GMM is not the best model for that kind of distribution. Eyeballing it, it seems that a mixture of beta distributions might do the trick.

The amazing Python library pomegranate it's very easy to use and allows you to fit mixtures of arbitrary distributions. Here you can see the code for the supported distributions, beta seems to be present:

https://pomegranate.readthedocs.io/en/latest/

Cheers!
Andres

fr_andres
  • 5,927
  • 2
  • 31
  • 44
  • 1
    Thanks Andres, I will check out this library, it looks interesting. (Although it turns out that the naive GMM gets me ~90% of the way there, so I'm not sure how much I actually care about getting it exactly right haha...). Cheers Ben. – Benjamin Doughty Apr 07 '21 at 00:37
  • The error with exp distributions piles up greatly towards the tails. If you are dealing with central values that may very well be fine, but be careful with outliers – fr_andres Apr 07 '21 at 02:53