15

Is there a general way to join SciPy (or NumPy) probability distributions to create a mixture probability distribution which can then be sampled from?

I have such a distribution for display using something like:

mixture_gaussian = (norm.pdf(x_axis, -3, 1) + norm.pdf(x_axis, 3, 1)) / 2

which if then plotted looks like:

double gaussian

However, I can't sample from this generated model, as it's just a list of points which will plot as the curve.

Note, this specific distribution is just a simple example. I'd like to be able to generate several kinds of distributions (including "sub"-distributions which are not just normal distributions). Ideally, I would hope there would be someway for the function to be automatically normalized (i.e. not having to do the / 2 explicitly as in the code above.

Does SciPy/NumPy provide some way of easily accomplishing this?

This answer provides a way that such a sampling from a multiple distributions could be done, but it certainly requires a bit of handcrafting for a given mixture distribution, especially when wanting to weight different "sub"-distributions differently. This is usable, but I would hope for method that's a bit cleaner and straight forward if possible. Thanks!

Jenny Shoars
  • 994
  • 3
  • 16
  • 40

4 Answers4

24

Sampling from a mixture of distributions (where PDFs are added with some coefficients c_1, c_2, ... c_n) is equivalent to sampling each independently, and then, for each index, picking the value from k-th sample, with probability c_k.

The latter, mixing, step can be efficiently done with numpy.random.choice. Here is an example where three distributions are mixed. The distributions are listed in distributions, and their coefficients in coefficients. There is a fat normal distribution, a uniform distribution, and a narrow normal distribution, with coefficients 0.5, 0.2, 0.3. The mixing happens at data[np.arange(sample_size), random_idx] after random_idx are generated according to given coefficients.

import numpy as np
import matplotlib.pyplot as plt

distributions = [
    {"type": np.random.normal, "kwargs": {"loc": -3, "scale": 2}},
    {"type": np.random.uniform, "kwargs": {"low": 4, "high": 6}},
    {"type": np.random.normal, "kwargs": {"loc": 2, "scale": 1}},
]
coefficients = np.array([0.5, 0.2, 0.3])
coefficients /= coefficients.sum()      # in case these did not add up to 1
sample_size = 100000

num_distr = len(distributions)
data = np.zeros((sample_size, num_distr))
for idx, distr in enumerate(distributions):
    data[:, idx] = distr["type"](size=(sample_size,), **distr["kwargs"])
random_idx = np.random.choice(np.arange(num_distr), size=(sample_size,), p=coefficients)
sample = data[np.arange(sample_size), random_idx]
plt.hist(sample, bins=100, density=True)
plt.show()

histogram

11

Following @PaulPanzer's pointer in the comments, I created the following subclass for easily creating mixture models from the SciPy distributions. Note, the pdf is not required for my question, but it was nice for me to have.

class MixtureModel(rv_continuous):
    def __init__(self, submodels, *args, **kwargs):
        super().__init__(*args, **kwargs)
        self.submodels = submodels

    def _pdf(self, x):
        pdf = self.submodels[0].pdf(x)
        for submodel in self.submodels[1:]:
            pdf += submodel.pdf(x)
        pdf /= len(self.submodels)
        return pdf

    def rvs(self, size):
        submodel_choices = np.random.randint(len(self.submodels), size=size)
        submodel_samples = [submodel.rvs(size=size) for submodel in self.submodels]
        rvs = np.choose(submodel_choices, submodel_samples)
        return rvs

mixture_gaussian_model = MixtureModel([norm(-3, 1), norm(3, 1)])
x_axis = np.arange(-6, 6, 0.001)
mixture_pdf = mixture_gaussian_model.pdf(x_axis)
mixture_rvs = mixture_gaussian_model.rvs(10)
Jenny Shoars
  • 994
  • 3
  • 16
  • 40
  • 1
    I thought you wanted to "weight different "sub"-distributions differently"; here they all get the same weight. –  Dec 12 '17 at 00:02
  • @CrazyIvan, that's correct. I will eventually want them to be able to be weighted differently. For the moment what I wrote is sufficient, but eventually I will have to change my random selections (and the scaling of the `pdf`) to something closer to what you did in your answer. – Jenny Shoars Dec 12 '17 at 21:59
  • @Jenny Shoars, I’m having a similar problem to yours. However, I’d be interested in weighting the input sub-distributions differently. Can you give me a hint how this could be implemented (especially for_pdf)? Btw, I had to replace the following line in your code `super().__init__(*args, **kwargs)` with `super(MixtureModel, self).__init__(*args, **kwargs)` in order to avoid an error message. – Oliver Angelil Mar 11 '18 at 22:30
3

Similar to other commenters on Jenny Shoars answer, I needed the weights to be uneven and also wanted to be able to look at more than the pdf.

I augmented her approach and extended the class so you can specify the weights, and also provide cdf and sf in addition pdf and rvs.

import numpy as np
from scipy import stats
import matplotlib.pyplot as plt 

class MixtureModel(stats.rv_continuous):
    def __init__(self, submodels, *args, weights = None, **kwargs):
        super().__init__(*args, **kwargs)
        self.submodels = submodels
        if weights is None:
            weights = [1 for _ in submodels]
        if len(weights) != len(submodels):
            raise(ValueError(f'There are {len(submodels)} submodels and {len(weights)} weights, but they must be equal.'))
        self.weights = [w / sum(weights) for w in weights]
        
    def _pdf(self, x):
        pdf = self.submodels[0].pdf(x) * self.weights[0]
        for submodel, weight in zip(self.submodels[1:], self.weights[1:]):
            pdf += submodel.pdf(x)  * weight
        return pdf
            
    def _sf(self, x):
        sf = self.submodels[0].sf(x) * self.weights[0]
        for submodel, weight in zip(self.submodels[1:], self.weights[1:]):
            sf += submodel.sf(x)  * weight
        return sf

    def _cdf(self, x):
        cdf = self.submodels[0].cdf(x) * self.weights[0]
        for submodel, weight in zip(self.submodels[1:], self.weights[1:]):
            cdf += submodel.cdf(x)  * weight
        return cdf

        

    def rvs(self, size):
        submodel_choices = np.random.choice(len(self.submodels), size=size, p = self.weights)
        submodel_samples = [submodel.rvs(size=size) for submodel in self.submodels]
        rvs = np.choose(submodel_choices, submodel_samples)
        return rvs

mixture_model = MixtureModel([stats.norm(-3, 1), 
                              stats.norm(3, 1), 
                              stats.uniform(loc=3, scale = 2)],
                             weights = [0.3, 0.5, 0.2])

Giving

x_axis = np.arange(-6, 6, 0.001)
plt.plot(x_axis, mixture_model.sf(x_axis), label = 'SF')
plt.plot(x_axis, mixture_model.cdf(x_axis), label = 'CDF')
plt.plot(x_axis, mixture_model.pdf(x_axis), label = 'PDF')

plt.hist(mixture_model.rvs(10**5), bins = 50, density = True, label = 'Sampled')
plt.legend()

plots

ramzeek
  • 2,226
  • 12
  • 23
0

The code below stores a 1000 samples from N(0,1) and 500 samples from N(7,2) in an array that can then be sampled from.

import numpy as np
from scipy import stats

d = np.concatenate((stats.norm.rvs(0.0, 1.0, 1000), stats.norm.rvs(7.0, 2.0, 500)))
np.random.choice(d, 3)  # sample 3 observations

Mixture components other than the Normal can be used (e.g., stats.poisson) and there can be arbitrary number of those.

ubiq
  • 1