4

I need to fit a 2D gaussian embedded into substantial uniform noise, as shown in the left plot below. I tried using sklearn.mixture.GaussianMixture with two components (code at the bottom), but this obviously fails as shown in the right plot below.

enter image description here

I want to assign probabilities to each element of belonging to the 2D Gaussian and to the uniform background noise. This seems like a simple enough task but I've found no "simple" way to do it.

Any advices? It doesn't need to be GMM, I'm open to other methods/packages.


import numpy as np
import matplotlib.pyplot as plt
from sklearn import mixture

# Generate 2D Gaussian data
N_c = 100
xy_c = np.random.normal((.5, .5), .05, (N_c, 2))

# Generate uniform noise
N_n = 1000
xy_n = np.random.uniform(.0, 1., (N_n, 2))

# Combine into a single data set
data = np.concatenate([xy_c, xy_n])

# fit a Gaussian Mixture Model with two components
model = mixture.GaussianMixture(n_components=2, covariance_type='full')
model.fit(data)
probs = model.predict_proba(data)
labels = model.predict(data)
# Separate the two clusters for plotting
msk0 = labels == 0
c0, p0 = data[msk0], probs[msk0].T[0]
msk1 = labels == 1
c1, p1 = data[msk1], probs[msk1].T[1]

# Plot
plt.subplot(121)
plt.scatter(*xy_n.T, c='b', alpha=.5)
plt.scatter(*xy_c.T, c='r', alpha=.5)
plt.xlim(0., 1.)
plt.ylim(0., 1.)

plt.subplot(122)
plt.scatter(*c0.T, c=p0, alpha=.75)
plt.scatter(*c1.T, c=p1, alpha=.75)
plt.colorbar()
# display predicted scores by the model as a contour plot
X, Y = np.meshgrid(np.linspace(0., 1.), np.linspace(0., 1.))
XX = np.array([X.ravel(), Y.ravel()]).T
Z = -model.score_samples(XX)
Z = Z.reshape(X.shape)
plt.contour(X, Y, Z)

plt.show()
yatu
  • 86,083
  • 12
  • 84
  • 139
Gabriel
  • 40,504
  • 73
  • 230
  • 404
  • In general this problem is pretty difficult. You'd need to make a model that mixes the Gaussian and uniform distributions, and then fit the parameters with something like the [EM algorithm](https://en.wikipedia.org/wiki/Expectation%E2%80%93maximization_algorithm). But there might be practically simpler method if you know more about your data. What are the "relative heights" of the noise and signal distributions? Do you know the SNR? Prior probabilities for any given point belonging to either class, or for the parameters of the Gaussian? Anything else? – bnaecker May 03 '20 at 17:43
  • No, I have no more data than the knowledge that there is only one Gaussian distribution hidden in the noise. – Gabriel May 03 '20 at 20:32
  • have you take a look at kernel density ? https://scikit-learn.org/stable/auto_examples/neighbors/plot_species_kde.html#sphx-glr-auto-examples-neighbors-plot-species-kde-py – CoMartel May 05 '20 at 13:16
  • Yes, KernelDensity could be useful here. [Here's](https://stackoverflow.com/questions/41577705/how-does-2d-kernel-density-estimation-in-python-sklearn-work) an example use case – yatu May 05 '20 at 13:21
  • I'm not sure how a KDE would help me in this situation. I don't need to characterize the distribution of the sample using a sum of kernels, I need a way to separate the 2D Gaussian from the noise. – Gabriel May 05 '20 at 13:24
  • I'm not entirely familiar with it. But my thought is that you could use some tool to estimate the density of the 2d set of points, and run some clusterization on it, turning it into a 3D problem perhaps. Just thinking out loud really though, not sure if this is the right tool – yatu May 05 '20 at 13:29
  • Unfortunately that wouldn't work, see comment in CoMartel's answer. – Gabriel May 05 '20 at 14:11

1 Answers1

1

I think kernel density can help you to localize the gaussian and exclude point outside of it (e.g in area with lesser densities)

Here is an example code :

import numpy as np
import matplotlib.pyplot as plt
from sklearn import mixture
from sklearn.neighbors import KernelDensity


# Generate 2D Gaussian data
N_c = 100
xy_c = np.random.normal((.2, .2), .05, (N_c, 2))

# Generate uniform noise
N_n = 1000
xy_n = np.random.uniform(.0, 1., (N_n, 2))

# Combine into a single data set
data = np.concatenate([xy_c, xy_n])
print(data.shape)

model = KernelDensity(kernel='gaussian',bandwidth=0.05)
model.fit(data)
probs = model.score_samples(data)

# Plot
plt.subplot(131)
plt.scatter(*xy_n.T, c='b', alpha=.5)
plt.scatter(*xy_c.T, c='r', alpha=.5)

plt.xlim(0., 1.)
plt.ylim(0., 1.)

# plot kernel score
plt.subplot(132)
plt.scatter(*data.T, c=probs, alpha=.5)

# display predicted scores by the model as a contour plot
X, Y = np.meshgrid(np.linspace(0., 1.), np.linspace(0., 1.))
XX = np.array([X.ravel(), Y.ravel()]).T
Z = -model.score_samples(XX)
Z = Z.reshape(X.shape)
plt.contour(X, Y, Z)
plt.xlim(0,1)
plt.ylim(0,1)

# plot kernel score with threshold
plt.subplot(133)
plt.scatter(*data.T, c=probs>0.5, alpha=.5) # here you can adjust the threshold
plt.colorbar()
plt.xlim(0,1)
plt.ylim(0,1)

And this is the output figure :

Output figure

I changed the center of the gaussian to ensure my code was working. The right panel display the kernel score with a threshold, which can be use in your case to filter out the noisy data outside of the gaussian, but you can't filter the noise inside the gaussian.

CoMartel
  • 3,521
  • 4
  • 25
  • 48
  • So with `probs>0.5`, you directly use the resulting probability from fitting `KernelDensity` to assign to a "cluster" or another? – yatu May 05 '20 at 13:42
  • Yes, proba (it's not really a proba) is more related to the number of sample in an area, so it will be higher in the gaussian area than in the noise area. Thresholding is a simple way to classify a sample as "in the gaussian" or " in the noise", but you could directly use `proba` to have a less binary output, like it is displayed in the middle panel – CoMartel May 05 '20 at 13:46
  • 1
    In reality, `proba` here is the log-likelihood of the data, so it behave like a probability (higher is better), but the values are normalized, so it's not a probability per say. – CoMartel May 05 '20 at 13:51
  • 1
    I see, I'm just starting to familiarize with this :) This seems like a very convenient approach for this problem. Great answer! @comartel – yatu May 05 '20 at 13:52
  • Although this is an approach that *looks* like it works, it's not a proper solution to the original issue. It requires manual tuning of the parameters (not unsupervised like GMMs), it is unable to account for noise within the "cluster" region, and it makes no use of the important knowledge that there is a clustered Gaussian embedded in *uniform* noise (not accounted into the model), and it does not assign proper probabilities to the points of belonging to either distribution (Gaussian/noise). Still, thank you CoMartel for your answer! – Gabriel May 05 '20 at 13:58
  • This is a completely unsupervised model. The data points are not labeled. Sure it requires tuning some parameter, but that is the case in almost all sklearn algorithms. It indeed does not account for noise within the cluster region though, but how could any algorithm account for that? The distributions have been generated from different covariances, but once the data points are mixed up together, I can't see how they can be distinguished. @gabriel – yatu May 05 '20 at 14:16
  • Yes to all your remarks, you're right ! As a complement : you have some methods you could try to automatically find a threshold (like Otsu thresholding for instance). Unfortunately, I don't think you can separate the noise from the signal in the gaussian area (but would be very interested if you find a way) – CoMartel May 05 '20 at 14:17
  • @yatu you are right about being "unsupervised", I misspoke. The "proper" generalization of using a KDE as an unsupervised clustering method would be [MeanShift](https://spin.atomicobject.com/2015/05/26/mean-shift-clustering/). The main issue is the lack of accounting for the uniform noise. This is an important piece of information in the model. It could be accounted by combining a Gaussian plus a uniform distribution into a single model, the same way GMMs combine several Gaussians I believe. – Gabriel May 05 '20 at 14:27
  • 1
    Yes agreed, if what you want is a probability that accounts for belonging to one or another distribution this is not the right approach then @Gabriel – yatu May 05 '20 at 14:38