How to generate random numbers with predefined probability distribution?

Question

I would like to implement a function in python (using numpy) that takes a mathematical function (for ex. p(x) = e^(-x) like below) as input and generates random numbers, that are distributed according to that mathematical-function's probability distribution. And I need to plot them, so we can see the distribution.

I need actually exactly a random number generator function for exactly the following 2 mathematical functions as input, but if it could take other functions, why not:

1) p(x) = e^(-x)
2) g(x) = (1/sqrt(2*pi)) * e^(-(x^2)/2)

Does anyone have any idea how this is doable in python?

Numpy has a number of distribution methods built in: https://docs.scipy.org/doc/numpy-1.14.0/reference/routines.random.html#distributions — Matthew Story, Jun 26 '18 at 20:13
Second one should be valid over all numbers, since the normalization constant is 2pi — C.Nivs, Jun 26 '18 at 20:14
[This](http://code.activestate.com/recipes/577264-random-numbers-with-arbitrary-probability-distribu/) link and the first comment under that post might help. — MoxieBall, Jun 26 '18 at 20:15
*"...that takes a `f(x)` as input ..."* What is `f(x)`, and what does it have to do with the two functions `p(x)` that you show? — Warren Weckesser, Jun 26 '18 at 20:40
@WarrenWeckesser i meant "a function" (not f(x)) as argument (input). The 2 p(x)'s are the mathematical functions that I want to use the python function for, which are as explained above: ".... and generates random numbers, that are distributed according to the following probability distribution function:" But you are right. It is a bit ambigious, and the 2 p(x)'s should have been named differently. — ZelelB, Jun 27 '18 at 08:59

filippo · Accepted Answer · 2018-06-27T15:20:29.140

12

For simple distributions like the ones you need, or if you have an easy to invert in closed form CDF, you can find plenty of samplers in NumPy as correctly pointed out in Olivier's answer.

For arbitrary distributions you could use Markov-Chain Montecarlo sampling methods.

The simplest and maybe easier to understand variant of these algorithms is Metropolis sampling.

The basic idea goes like this:

start from a random point x and take a random step xnew = x + delta
evaluate the desired probability distribution in the starting point p(x) and in the new one p(xnew)
if the new point is more probable p(xnew)/p(x) >= 1 accept the move
if the new point is less probable randomly decide whether to accept or reject depending on how probable¹ the new point is
new step from this point and repeat the cycle

It can be shown, see e.g. Sokal², that points sampled with this method follow the acceptance probability distribution.

An extensive implementation of Montecarlo methods in Python can be found in the PyMC3 package.

Example implementation

Here's a toy example just to show you the basic idea, not meant in any way as a reference implementation. Please refer to mature packages for any serious work.

def uniform_proposal(x, delta=2.0):
    return np.random.uniform(x - delta, x + delta)

def metropolis_sampler(p, nsamples, proposal=uniform_proposal):
    x = 1 # start somewhere

    for i in range(nsamples):
        trial = proposal(x) # random neighbour from the proposal distribution
        acceptance = p(trial)/p(x)

        # accept the move conditionally
        if np.random.uniform() < acceptance:
            x = trial

        yield x

Let's see if it works with some simple distributions

Gaussian mixture

def gaussian(x, mu, sigma):
    return 1./sigma/np.sqrt(2*np.pi)*np.exp(-((x-mu)**2)/2./sigma/sigma)

p = lambda x: gaussian(x, 1, 0.3) + gaussian(x, -1, 0.1) + gaussian(x, 3, 0.2)
samples = list(metropolis_sampler(p, 100000))

Cauchy

def cauchy(x, mu, gamma):
    return 1./(np.pi*gamma*(1.+((x-mu)/gamma)**2))

p = lambda x: cauchy(x, -2, 0.5)
samples = list(metropolis_sampler(p, 100000))

Arbitrary functions

You don't really have to sample from proper probability distributions. You might just have to enforce a limited domain where to sample your random steps³

p = lambda x: np.sqrt(x)
samples = list(metropolis_sampler(p, 100000, domain=(0, 10)))

p = lambda x: (np.sin(x)/x)**2
samples = list(metropolis_sampler(p, 100000, domain=(-4*np.pi, 4*np.pi)))

Conclusions

There is still way too much to say, about proposal distributions, convergence, correlation, efficiency, applications, Bayesian formalism, other MCMC samplers, etc. I don't think this is the proper place and there is plenty of much better material than what I could write here available online.

The idea here is to favor exploration where the probability is higher but still look at low probability regions as they might lead to other peaks. Fundamental is the choice of the proposal distribution, i.e. how you pick new points to explore. Too small steps might constrain you to a limited area of your distribution, too big could lead to a very inefficient exploration.
Physics oriented. Bayesian formalism (Metropolis-Hastings) is preferred these days but IMHO it's a little harder to grasp for beginners. There are plenty of tutorials available online, see e.g. this one from Duke university.
Implementation not shown not to add too much confusion, but it's straightforward you just have to wrap trial steps at the domain edges or make the desired function go to zero outside the domain.

edited Jun 27 '18 at 15:20

answered Jun 27 '18 at 07:51

filippo

5,197
2
21
44

thanks for the great answer! Just a question, how can I plot the code as you did to get that plots? Could you please add the code used for plotting? And what is burnin? Thx! – ZelelB Jun 27 '18 at 14:56
First samples in a markov chain are usually highly correlated so it's a good practice to discard them, especially if you need just a few points. Burn-in is the number of samples to drop. My plots are histograms done with `plt.hist(samples, bins=100)`. – filippo Jun 27 '18 at 15:04
@ZelelB by the way, if you just need exponentials, normals and other simple distributions, Metropolis is not the way to go. Using numpy samplers is a lot safer than baking your own like this, especially as I left out a lot of details about writing good montecarlo code (there's lots of pitfalls to be aware of when working with random numbers). – filippo Jun 27 '18 at 15:08
actually burn-in is not really needed here, I'll remove it as it adds unnecessary confusion – filippo Jun 27 '18 at 15:18
Nm, I misunderstood the problem; this approach is fine. – Robert Dodier Jun 27 '18 at 16:48
@filippo I think your solution is quite what I need, because it is an assignment, and we cannot use the predefined numpy & scipy libraries which are going to get it done in one line. But I am skeptical about the metropolis. I don't know if I have to use it. – ZelelB Jun 27 '18 at 17:07
1

@ZelelB I thought so. If you didn't cover Montecarlo methods in the course then Metropolis is not what you're supposed to use. You most probably have to use some CDF inversion method to transform uniform samples (see the bottom of Olivier's answer). For normal distributions you can use Box-Muller which is really easy to implement (see e.g. [this tutorial](https://glowingpython.blogspot.com/2013/01/box-muller-transformation.html)). There are plenty of other well known transformations you could use, Cauchy samples can be generated from normal pairs for example. – filippo Jun 27 '18 at 17:11
But the thing is: Ok, for the first function, I got it, it is a gaussian distribution, and setting a=1/2pi , mu=0 and sigma=1 you get it. But what is the second function? How to implement it like you did with the gaussian one? – ZelelB Jun 27 '18 at 17:24
@filippo the solution with the domain isn't working. It is throwing an exception, as the domain isn't in the metropolis_sampler's signature. Do you know how I can fix it? – ZelelB Jun 27 '18 at 18:17
@ZelelB there's a note about domain, the easiest way is to define your `p` function so that it returns `0` outside the domain you want. I believe there's plenty of material here, between Oliver's answer, mine and comments, to get you started on your assignment – filippo Jun 28 '18 at 08:29
@filippo thank you so much for the great answer! Just last two questions: What is actually the delta=2.0 in the uniform_proposal mathematically? Is it the step of the walk that we try or what exactly? And what are the values of the y-axis on the plots? for example by on your last example plot the (2k, 4k, 6k, 8k)? Is it the Nr. of times, the generator runned? I mean the nr of iterations? – ZelelB Jun 29 '18 at 16:27
1

@ZelelB it's an histogram, you divide your domain in a number of bins and count how many samples each bin contains. So you have `bins` for `x` and `counts` for `y`. `delta` is the maximum distance your walker can explore at each step. `2.0` there is quite arbitrary, there are theoretical considerations that relate it to the variance of the peak you're exploring but for simple use cases you just need it to be empirically big enough for the walker not to be trapped in a high probability region and properly explore all the space – filippo Jun 29 '18 at 19:57
Thank you so much for all this help! Very helpful and such a great answer! – ZelelB Jul 07 '18 at 16:45
This is a fantastic guide. I'm looking for a way to store distributions as objects based on some starting data. These objects should be able to be added together too. My thought was to estimate and store a bandwidth parameter, and use it with a KDE which could be used to generate data as per your answer. If two distributions need to be added, then I just add their two KDEs with their two stored bandwidth estimates, and sample from that. Or create a new bandwidth estimate from concatenating a sampling of both. Unfortunately I have to do all this in Java :( – rocksNwaves Aug 26 '21 at 22:26

Olivier Melançon · Answer 2 · 2021-02-17T22:30:19.867

6

NumPy offers a wide range of probability distributions.

The first function is an exponential distribution with parameter 1.

np.random.exponential(1)

The second one is a normal distribution with mean 0 and variance 1.

np.random.normal(0, 1)

Note that in both case, the arguments are optional as these are the default values for these distributions.

As a sidenote, you can also find those distributions in the random module as random.expovariate and random.gauss respectively.

More general distributions

While NumPy will likely cover all your needs, remember that you can always compute the inverse cumulative distribution function of your distribution and input values from a uniform distribution.

inverse_cdf(np.random.uniform())

By example if NumPy did not provide the exponential distribution, you could do this.

def exponential():
    return -np.log(-np.random.uniform())

If you encounter distributions which CDF is not easy to compute, then consider filippo's great answer.

edited Feb 17 '21 at 22:30

answered Jun 26 '18 at 20:16

Olivier Melançon

21,584
4
41
73

1

The first one is most likely the exponential distribution with `beta=1`, and should have the domain restricted to [0, inf), in which case it is a true probability distribution – ALollz Jun 26 '18 at 20:25
@filippo I am not familiar with that algorithm, if you were to provide an example as answer I would be interested – Olivier Melançon Jun 26 '18 at 20:55
@OlivierMelançon thank you for your detailed answer! But I don't really get how to use that to implement my python function, so that it takes for example the given p(x) or g(x) as input, and generates random numbers distributed according to them. And is it possible to plot the result then? So that a visual explanation is there? – ZelelB Jun 27 '18 at 09:22
I think I need to use something like here https://en.wikipedia.org/wiki/Normal_distribution#Generating_values_from_normal_distribution, but I am weak to implement it! – ZelelB Jun 27 '18 at 11:43
@ZelelB Have a look at filippo's great answer. Using sympy you could convert the strings to functions, then use Metropolis sampling. – Olivier Melançon Jun 27 '18 at 13:11