Is it possible to create a random variable with a non-smooth probability distribution in Scipy stats?

Question

I can see how to define a custom continuous random variable class in Scipy using the rv_continuous class or a discrete one using rv_discrete.

But is there any way to create a non-continuous random variable which can model the following random variable which is a combination of a normal distribution and a discontinuous distribution that simply outputs 0?

def my_rv(a, loc, scale):
    if np.random.random() > a:
        return 0
    else:
        return np.random.normal(loc=loc, scale=scale)

# Example
samples = np.array([my_rv(0.5,0,1) for i in range(1000)])
plt.hist(samples, bins=21, density=True)
plt.grid()
plt.show()
print(samples[:10].round(2))
# [0.   0.   0.   0.   0.   0.   2.2  0.07 0.   0.12]

The pdf of this random variable looks something like this:

[In my actual application a is quite small, e.g. 0.01 so this variable is 0 most of the time].

Obviously, I can do this without Scipy in Python either with the simple function above or by generating samples from each distribution separately and then merging them appropriately but I was hoping to make this a Scipy custom random variable and get all the benefits and features of that class.

What you're describing is called a mixture distribution. In the example you gave, it's a mixture of a normal distribution and a degenerate distribution which has all its mass at 0. Maybe scipy already has something to handle mixture distributions. By the way, your function `my_rv` doesn't seem right; I think you want the last line to be just `return np.random.normal(loc = loc, scale = scale)`, i.e. without multiplying by `a`. — Robert Dodier, Nov 16 '20 at 03:11
I agree with @RobertDodier's comment that you shouldn't be multiplying by `a`. — Frank Yellin, Nov 16 '20 at 03:29
Correct. Should not have multiplied the 2nd return value by `a`. Thank you. I have modified the code snippet. — Bill, Nov 16 '20 at 03:46
Thanks @RobertDodier. Knowing now that these are called *mixture distributions* I was able to find the answer [here](https://stackoverflow.com/questions/47759577/creating-a-mixture-of-probability-distributions-for-sampling). Based on this, it seems like you can't model them completely in Scipy but instead you can model each distribution separately then do the final random-combination of the two when sampling. However, doesn't seem like this will work in my case since one of my distributions is degenerate. — Bill, Nov 16 '20 at 03:54
Yeah, this answer: https://stackoverflow.com/a/47763145/871096 is similar to what I would do, but the code which is shown there assumes equal weight to each mixture component. I wrote some code in Java some time ago for handling mixtures among other things; maybe it could be an inspiration in some way. See: https://github.com/robert-dodier/riso and then Mixture.java in riso/src/riso/distributions. The fun part of mixture distributions is finding parameters for them; a standard approach is the so-called expectation-maximization algorithm. Let me know if you're interested in that. — Robert Dodier, Nov 16 '20 at 04:02
I don't see that having a degenerate distribution will cause any trouble for sampling. Your function `my_rv` is already doing that. Maybe I am just not seeing what the problem is. — Robert Dodier, Nov 16 '20 at 04:04
I tried to explain my reason for asking the question in the last paragraph. I was hoping to benefit from the many other built-in features of the `scipy.stats.rv_continuous` class such as the `pdf`, `cdf` methods and maybe others (i.e. not just generating samples). Before hand-coding it all I wanted to see if there is an existing package I could use. Maybe there is something other than Scipy? — Bill, Nov 16 '20 at 04:09
what "benefits and features" are you hoping to get? why not solve the degeneracy first, e.g. use `normal(0, scale/100)` instead of just a delta-spike at zero. note that the bigger the difference in variance between the two components the more careful you have to be using the mixture, e.g. gradient descent will tend to get "stuck" and will have trouble escaping zero — Sam Mason, Nov 16 '20 at 12:16
@SamMason Not sure yet! (I'm at the beginning of a research project where we will be modelling and analysing different types of these 'multimodal' signals so I just wanted to start off on the right foot by seeing what is possible in existing packages). One of the benefits of scipy.stats is the wide range of standard probability distributions at your disposal. Your idea of approximating the degenerate function is great. Thanks. Might use that... — Bill, Nov 16 '20 at 19:25

Frank Yellin · Answer 1 · 2020-11-17T14:46:34.063

You could do something like:

temp = np.random.normal(loc=loc, scale=scale, size=1000)
mask = np.random.choice((0, 1), size=1000, p=(a, 1-a))
temp[np.where(mask == 0)] = 0

So you start off with a normal distribution, and then you replace each element with 0 with probability a. (Note, that I'm using Robert's correction above. If you think you really need to multiply by a, then do so.)

Alternatively, the last line could just be:

result = temp * mask

But this only works when you're setting some fraction of the values to 0.

=== Edited ===

Change #1. In my call to np.random.choice, I left out the most important part! p=(a, 1-a) to give you the right probabilities. Sorry about that. I must have deleted that after copying.

Change #2. I'm assuming that a is fairly close to 1, and that you need to calculate most of the normally distributed numbers anyway. (Calculating a normally distributed random number is pretty cheap.). But if a is close to 0, you might instead want:

result = np.random.choice((0.0, 1.0), size=1000, p=(a, 1-a))
mask = np.where(result != 0)
result[mask] = np.random.normal(size=len(mask[0]))

which only calculates the number of normally-distributed values that you need.

Thanks. This is what I'll do if this type of distribution is not possible Scipy. — Bill, Nov 16 '20 at 03:50

Bill · Answer 2 · 2021-01-26T04:46:12.813

I recently discovered there are other packages that can model more general types of stochastic processes such as this one.

I'm not sure this code is completely correct but I think the above process can be represented in Pyro something like this:

import torch
import pyro
import pyro.distributions as dist

def process(a, loc, scale):
    c = pyro.sample('c', dist.Bernoulli(a))
    switch = [
        pyro.deterministic('my_rv', torch.tensor(0.0)),
        pyro.sample('my_rv', pyro.distributions.Normal(loc, scale))
    ]
    return switch[c.long()]

The documentation explains how Pyro handles discrete latent variables.

I haven't figured it all out yet, but I believe you can do a lot with pyro such as conditioning the model on data and probabilistic inference.

Another package is PyMC3. I would be interested to know if there are others that would be good for this.

Is it possible to create a random variable with a non-smooth probability distribution in Scipy stats?

2 Answers2