33

The random module (http://docs.python.org/2/library/random.html) has several fixed functions to randomly sample from. For example random.gauss will sample random point from a normal distribution with a given mean and sigma values.

I'm looking for a way to extract a number N of random samples between a given interval using my own distribution as fast as possible in python. This is what I mean:

def my_dist(x):
    # Some distribution, assume c1,c2,c3 and c4 are known.
    f = c1*exp(-((x-c2)**c3)/c4)
    return f

# Draw N random samples from my distribution between given limits a,b.
N = 1000
N_rand_samples = ran_func_sample(my_dist, a, b, N)

where ran_func_sample is what I'm after and a, b are the limits from which to draw the samples. Is there anything of that sort in python?

Gabriel
  • 40,504
  • 73
  • 230
  • 404
  • You can just call your function N times. However, you still need to specify what distribution you want your `x` values to be chosen from. – BrenBarn Jan 13 '14 at 20:28
  • My distribution is my function. I need to evaluate that function randomly N times between a certain interval. – Gabriel Jan 13 '14 at 20:30
  • 2
    Your function isn't a distribution. You need to decide what the distribution is on the arguments you call it with. If you want to pass it N random values "between a certain interval", where are you specifying that interval in your code example? Do you want the random `x` values to be chosen uniformly from that interval, or in some other way? – BrenBarn Jan 13 '14 at 20:32
  • I forgot to specify the interval, I'll add it to the code. You are right, I explained myself poorly giving a `x**2` function and not a distribution. I'll try to fix that now. – Gabriel Jan 13 '14 at 20:34
  • @BrenBarn is it better now? – Gabriel Jan 13 '14 at 20:45
  • 1
    I have such code for discrete distributions. Everything can be approximated with a discrete distribution, and it makes things a lot simpler (though still nontrivial, to get numerical robustness). If that helps you I could wrap it up. – Eelco Hoogendoorn Jan 13 '14 at 21:06
  • @EelcoHoogendoorn that would definitely be of help, even if it is a first step in the right direction. If you would be kind enough to share your code, I'd greatly appreciate it! – Gabriel Jan 13 '14 at 21:09

5 Answers5

23

You need to use Inverse transform sampling method to get random values distributed according to a law you want. Using this method you can just apply inverted function to random numbers having standard uniform distribution in the interval [0,1].

After you find the inverted function, you get 1000 numbers distributed according to the needed distribution this obvious way:

[inverted_function(random.random()) for x in range(1000)]

More on Inverse Transform Sampling:

Also, there is a good question on StackOverflow related to the topic:

Community
  • 1
  • 1
Igor Chubin
  • 61,765
  • 13
  • 122
  • 144
  • 1
    I implemented a function doing the inverse transform sampling with a help of SymPy and asked for a review on a Code Review Stack Exchange: [Link](https://codereview.stackexchange.com/questions/196286/inverse-transform-sampling). Maybe someone can find it helpful. – Georgy Jun 11 '18 at 15:59
  • 7
    Many functions that one may wish to draw random samples from are not analytically invertible. – Mead Feb 01 '19 at 05:27
8

This code implements the sampling of n-d discrete probability distributions. By setting a flag on the object, it can also be made to be used as a piecewise constant probability distribution, which can then be used to approximate arbitrary pdf's. Well, arbitrary pdfs with compact support; if you efficiently want to sample extremely long tails, a non-uniform description of the pdf would be required. But this is still efficient even for things like airy-point-spread functions (which I created it for, initially). The internal sorting of values is absolutely critical there to get accuracy; the many small values in the tails should contribute substantially, but they will get drowned out in fp accuracy without sorting.

class Distribution(object):
    """
    draws samples from a one dimensional probability distribution,
    by means of inversion of a discrete inverstion of a cumulative density function

    the pdf can be sorted first to prevent numerical error in the cumulative sum
    this is set as default; for big density functions with high contrast,
    it is absolutely necessary, and for small density functions,
    the overhead is minimal

    a call to this distibution object returns indices into density array
    """
    def __init__(self, pdf, sort = True, interpolation = True, transform = lambda x: x):
        self.shape          = pdf.shape
        self.pdf            = pdf.ravel()
        self.sort           = sort
        self.interpolation  = interpolation
        self.transform      = transform

        #a pdf can not be negative
        assert(np.all(pdf>=0))

        #sort the pdf by magnitude
        if self.sort:
            self.sortindex = np.argsort(self.pdf, axis=None)
            self.pdf = self.pdf[self.sortindex]
        #construct the cumulative distribution function
        self.cdf = np.cumsum(self.pdf)
    @property
    def ndim(self):
        return len(self.shape)
    @property
    def sum(self):
        """cached sum of all pdf values; the pdf need not sum to one, and is imlpicitly normalized"""
        return self.cdf[-1]
    def __call__(self, N):
        """draw """
        #pick numbers which are uniformly random over the cumulative distribution function
        choice = np.random.uniform(high = self.sum, size = N)
        #find the indices corresponding to this point on the CDF
        index = np.searchsorted(self.cdf, choice)
        #if necessary, map the indices back to their original ordering
        if self.sort:
            index = self.sortindex[index]
        #map back to multi-dimensional indexing
        index = np.unravel_index(index, self.shape)
        index = np.vstack(index)
        #is this a discrete or piecewise continuous distribution?
        if self.interpolation:
            index = index + np.random.uniform(size=index.shape)
        return self.transform(index)


if __name__=='__main__':
    shape = 3,3
    pdf = np.ones(shape)
    pdf[1]=0
    dist = Distribution(pdf, transform=lambda i:i-1.5)
    print dist(10)
    import matplotlib.pyplot as pp
    pp.scatter(*dist(1000))
    pp.show()

And as a more real-world relevant example:

x = np.linspace(-100, 100, 512)
p = np.exp(-x**2)
pdf = p[:,None]*p[None,:]     #2d gaussian
dist = Distribution(pdf, transform=lambda i:i-256)
print dist(1000000).mean(axis=1)    #should be in the 1/sqrt(1e6) range
import matplotlib.pyplot as pp
pp.scatter(*dist(1000))
pp.show()
Eelco Hoogendoorn
  • 10,459
  • 1
  • 44
  • 42
  • Glad I could help. Does approximating the distribution as piecewise continuous suffice for your application? How fast this approach is depends on the resolution you aim for; generating the distribution is N log(N), and sampling has complexity N, with a low time constant. Though I havnt tested it, I could imagine it achieves sufficient accuracy much more efficiently in many scenarios, even where a closed form solution exists. But the main appeal to me is in the flexibility of the approach, permitting arbitrary distributions. – Eelco Hoogendoorn Jan 21 '14 at 22:45
  • Could you please explain the nature of the transform ? I don't quite get what it does! – Sebastiano1991 Apr 09 '19 at 13:42
  • If left unspecified, the return values are identical to the indices of the input array specifying the discrete PDF. The transform simply allows you to remap those values; map them to a range [0..1) or whatever. Frankly it does not really belong in this class for the sake of answering this question; just ended up here from the project I extracted it from. – Eelco Hoogendoorn Apr 09 '19 at 19:55
8

Here is a rather nice way of performing inverse transform sampling with a decorator.

import numpy as np
from scipy.interpolate import interp1d

def inverse_sample_decorator(dist):
    
    def wrapper(pnts, x_min=-100, x_max=100, n=1e5, **kwargs):
        
        x = np.linspace(x_min, x_max, int(n))
        cumulative = np.cumsum(dist(x, **kwargs))
        cumulative -= cumulative.min()
        f = interp1d(cumulative/cumulative.max(), x)
        return f(np.random.random(pnts))
    
    return wrapper

Using this decorator on a Gaussian distribution, for example:

@inverse_sample_decorator
def gauss(x, amp=1.0, mean=0.0, std=0.2):
    return amp*np.exp(-(x-mean)**2/std**2/2.0)

You can then generate sample points from the distribution by calling the function. The keyword arguments x_min and x_max are the limits of the original distribution and can be passed as arguments to gauss along with the other key word arguments that parameterise the distribution.

samples = gauss(5000, mean=20, std=0.8, x_min=19, x_max=21)

Alternatively, this can be done as a function that takes the distribution as an argument (as in your original question),

def inverse_sample_function(dist, pnts, x_min=-100, x_max=100, n=1e5, 
                            **kwargs):
        
    x = np.linspace(x_min, x_max, int(n))
    cumulative = np.cumsum(dist(x, **kwargs))
    cumulative -= cumulative.min()
    f = interp1d(cumulative/cumulative.max(), x)
        
    return f(np.random.random(pnts))
JS00N
  • 91
  • 1
  • 4
  • Using `f(np.random.uniform(low=,high=, size=pnts))` could avoid `ValueError: A value in x_new is below the interpolation range.` – TDHTTT Nov 05 '20 at 06:26
  • @TDHTTT yes that would work, but I should have properly scale cumulative between [0, 1]. I will edit my answer to fix this. – JS00N Nov 06 '20 at 20:30
6

I was in a similar situation but I wanted to sample from a multivariate distribution, so, I implemented a rudimentary version of Metropolis-Hastings (which is an MCMC method).

def metropolis_hastings(target_density, size=500000):
    burnin_size = 10000
    size += burnin_size
    x0 = np.array([[0, 0]])
    xt = x0
    samples = []
    for i in range(size):
        xt_candidate = np.array([np.random.multivariate_normal(xt[0], np.eye(2))])
        accept_prob = (target_density(xt_candidate))/(target_density(xt))
        if np.random.uniform(0, 1) < accept_prob:
            xt = xt_candidate
        samples.append(xt)
    samples = np.array(samples[burnin_size:])
    samples = np.reshape(samples, [samples.shape[0], 2])
    return samples

This function requires a function target_density which takes in a data-point and computes its probability.

For details check-out this detailed answer of mine.

Abdul Fatir
  • 6,159
  • 5
  • 31
  • 58
4
import numpy as np
import scipy.interpolate as interpolate

def inverse_transform_sampling(data, n_bins, n_samples):
    hist, bin_edges = np.histogram(data, bins=n_bins, density=True)
    cum_values = np.zeros(bin_edges.shape)
    cum_values[1:] = np.cumsum(hist*np.diff(bin_edges))
    inv_cdf = interpolate.interp1d(cum_values, bin_edges)
    r = np.random.rand(n_samples)
    return inv_cdf(r)

So if we give our data sample that has a specific distribution, the inverse_transform_sampling function will return a dataset with exactly the same distribution. Here the advantage is that we can get our own sample size by specifying it in the n_samples variable.

Srivatsan
  • 9,225
  • 13
  • 58
  • 83
  • 10
    Either you are the same person or you probably should have quoted the code's origin [in this blog](http://www.nehalemlabs.net/prototype/blog/2013/12/16/how-to-do-inverse-transformation-sampling-in-scipy-and-numpy/). Which has some nice explanatory graphics as well. – Jost Jan 20 '17 at 14:10
  • The source for this code is: https://tmramalho.github.io/blog/2013/12/16/how-to-do-inverse-transformation-sampling-in-scipy-and-numpy/ – Gabriel Sep 15 '19 at 13:17