Generate random numbers with a given (numerical) distribution

Question

I have a file with some probabilities for different values e.g.:

I would like to generate random numbers using this distribution. Does an existing module that handles this exist? It's fairly simple to code on your own (build the cumulative density function, generate a random value [0,1] and pick the corresponding value) but it seems like this should be a common problem and probably someone has created a function/module for it.

I need this because I want to generate a list of birthdays (which do not follow any distribution in the standard random module).

Other than `random.choice()`? You build the master list with the proper number of occurrences and choose one. This is a duplicate question, of course. — S.Lott, Nov 24 '10 at 11:03
possible duplicate of [Random weighted choice](http://stackoverflow.com/questions/2073235/random-weighted-choice) — S.Lott, Nov 24 '10 at 11:03
@S.Lott isn't that very memory intensive for big differences in the distribution? — Lucas Moeskops, Nov 24 '10 at 11:05
@S.Lott: Your choice method would probably be fine for small numbers of occurrences but I'd rather avoid creating huge lists when it is not necessary. — pafcu, Nov 24 '10 at 11:10
@S.Lott: The question you linked to is related, but I asked about existing methods while the answers to that question use a "homegrown" CDF approach I already mentioned. — pafcu, Nov 24 '10 at 11:11
@Lucasmus: define "big". @pafcu: define "huge". This will work delightfully well until you fill up all of memory with the distribution table. Will you really have billions of choices? Really? For any practical simulation, a few thousand values in a choice table essentially nothing. — S.Lott, Nov 24 '10 at 11:18
@S.Lott: OK, about 10000*365 = 3650000 = 3.6 million elements. I'm not sure about the memory usage in Python, but it's at least 3.6M*4B =14.4MB. Not a huge amount, but not something you should ignore either when there is an equally simple method that does not require the extra memory. — pafcu, Nov 24 '10 at 11:25
@S.Lott for example one value with chance of occurrence 0.2 and one value with chance of occurrence 0.0037. Requires the list to contain 10000 elements (2000x the first, 37x the second) etc. — Lucas Moeskops, Nov 24 '10 at 11:46
@Lucasmus: Isn't it actually a question of significant numbers? One rate of 0.1234 and another with 0.1235 would require 1234+1235=2469 elements, even though the difference in chances is small. — pafcu, Nov 24 '10 at 11:53
@pafcu: I think that requires 10000 elements as there are also the elements that cover `1-0.1234-0.1235` — Lucas Moeskops, Nov 24 '10 at 11:55
@Lucasmus: Sure, I just wanted to point out that it's not a question of how different the probabilities are, just the number of significant numbers in them. — pafcu, Nov 24 '10 at 11:57
@S.Lott won't work if probabilities are not rational numbers. What if I have two items, one with probability sqrt(2)/2 and the other with 1 - sqrt(2)/2? If I want to sample from this distribution with 10 decimal places precision in the relative frequencies, I'd have to have a master table with around 10^10 repeated items. There are MUCH more efficient ways of doing that, and they are quite simple. No need of master tables. — Rafael S. Calsaverini, Jan 21 '13 at 19:19
If `O(log n)` `__getitem__` is ok then you only need `O(n)` memory regardless the precision. Code example: [WeightedPopulation](http://stackoverflow.com/a/13052108/4279) — jfs, Mar 22 '15 at 15:49

score 207 · Accepted Answer · edited Apr 04 '18 at 20:06

207

scipy.stats.rv_discrete might be what you want. You can supply your probabilities via the values parameter. You can then use the rvs() method of the distribution object to generate random numbers.

As pointed out by Eugene Pakhomov in the comments, you can also pass a p keyword parameter to numpy.random.choice(), e.g.

numpy.random.choice(numpy.arange(1, 7), p=[0.1, 0.05, 0.05, 0.2, 0.4, 0.2])

If you are using Python 3.6 or above, you can use random.choices() from the standard library – see the answer by Mark Dickinson.

edited Apr 04 '18 at 20:06

Mark Ransom

299,747
42
398
622

answered Nov 24 '10 at 12:15

Sven Marnach

574,206
118
941
841

18

On my machine `numpy.random.choice()` is almost 20 times faster. – Eugene Pakhomov Jun 18 '16 at 06:26
@EugenePakhomov I don't quite understand your comment. So a function doing something completely different is faster than the one I suggested. My recommendation would still be to use the function that does what you want rather than a function that does something else, even if the function that does something else is faster. – Sven Marnach Jun 19 '16 at 10:58
11

it does exactly the same w.r.t. to the original question. E.g.: `numpy.random.choice(numpy.arange(1, 7), p=[0.1, 0.05, 0.05, 0.2, 0.4, 0.2])` – Eugene Pakhomov Jun 20 '16 at 12:17
2

Surprisingly, rv_discrete.rvs() works in O(len(p) * size) time and memory! While choice() seems to run in optimal O(len(p) + log(len(p)) * size) time. – alyaxey Oct 09 '17 at 16:16
3

If you're using **Python 3.6** or newer there's [**another answer**](https://stackoverflow.com/a/41852266/5987) that doesn't require any addon packages. – Mark Ransom Apr 04 '18 at 18:47

Mark Dickinson · Answer 2 · 2022-08-08T07:05:02.847

188

Since Python 3.6, there's a solution for this in Python's standard library, namely random.choices.

Example usage: let's set up a population and weights matching those in the OP's question:

>>> from random import choices
>>> population = [1, 2, 3, 4, 5, 6]
>>> weights = [0.1, 0.05, 0.05, 0.2, 0.4, 0.2]

Now choices(population, weights) generates a single sample, contained in a list of length 1:

>>> choices(population, weights)
[4]

The optional keyword-only argument k allows one to request more than one sample at once. This is valuable because there's some preparatory work that random.choices has to do every time it's called, prior to generating any samples; by generating many samples at once, we only have to do that preparatory work once. Here we generate a million samples, and use collections.Counter to check that the distribution we get roughly matches the weights we gave.

>>> million_samples = choices(population, weights, k=10**6)
>>> from collections import Counter
>>> Counter(million_samples)
Counter({5: 399616, 6: 200387, 4: 200117, 1: 99636, 3: 50219, 2: 50025})

edited Aug 08 '22 at 07:05

answered Jan 25 '17 at 12:59

Mark Dickinson

29,088
9
83
120

Is there a Python 2.7 version to this? – abbas786 Sep 05 '18 at 01:06
1

@abbas786: Not built in, but the other answers to this question should all work on Python 2.7. You could also look up the Python 3 source for random.choices and copy that, if so inclined. – Mark Dickinson Nov 08 '18 at 18:55
1

For me `random.choices` with `k=1` returns a list of length one, i.e., `choices(population, weights)` ought to return `[4]` – christianbrodbeck Aug 07 '22 at 22:03
@christianbrodbeck: Thanks, fixed. I almost always generate those snippets by copy-and-paste, so obviously something went wrong here. – Mark Dickinson Aug 08 '22 at 07:06
Thanks! I was wondering whether it’s a version issue but that explains it. – christianbrodbeck Aug 09 '22 at 14:04
Is there a way to guarantee that the sample will not contain the same member twice – theonlygusti Jul 17 '23 at 11:44

sdcvvc · Answer 3 · 2015-12-29T12:17:39.207

33

An advantage to generating the list using CDF is that you can use binary search. While you need O(n) time and space for preprocessing, you can get k numbers in O(k log n). Since normal Python lists are inefficient, you can use array module.

If you insist on constant space, you can do the following; O(n) time, O(1) space.

def random_distr(l):
    r = random.uniform(0, 1)
    s = 0
    for item, prob in l:
        s += prob
        if s >= r:
            return item
    return item  # Might occur because of floating point inaccuracies

edited Dec 29 '15 at 12:17

answered Nov 24 '10 at 12:06

sdcvvc

25,343
4
66
102

The order of the (item, prob) pairs in the list matters in your implementation, right? – stackoverflowuser2010 Jun 06 '13 at 22:37
1

@stackoverflowuser2010: It shouldn't matter (modulo errors in floating point) – sdcvvc Jun 07 '13 at 12:52
Nice. I found this to be 30% faster than scipy.stats.rv_discrete. – Adrienne May 03 '15 at 03:07
1

Quite a few times this function will throw a KeyError because the last line. – imrek Sep 09 '15 at 20:02
@DrunkenMaster: I don't understand. Are you aware `l[-1]` returns the last element of the list? – sdcvvc Sep 09 '15 at 20:33
return l[-1], returns 'item' or the last ordered pair from 'l', unlike the other return statement, in your code @sdcvvc – Vaibhav Dec 29 '15 at 11:36
@sdcvvc: I think 'item' is out of the scope of the for loop. Should be l[-1][0] by your logic; assuming l is a list of ordered pairs or the like. – Vaibhav Dec 29 '15 at 13:00
@Vaibhav: in Python last index is still available after the loop. I've considered `l[-1][0]` but the second form seems more obvious to the reader. – sdcvvc Dec 29 '15 at 13:05
I see, sorry, i'm not a pro in python.. it's fine then. – Vaibhav Dec 29 '15 at 13:08

score 18 · Answer 4 · edited Nov 14 '17 at 20:49

18

Maybe it is kind of late. But you can use numpy.random.choice(), passing the p parameter:

val = numpy.random.choice(numpy.arange(1, 7), p=[0.1, 0.05, 0.05, 0.2, 0.4, 0.2])

edited Nov 14 '17 at 20:49

Engineero

12,340
5
53
75

answered Dec 01 '13 at 00:59

Heberto Mayorquin

10,605
8
42
46

1

The OP doesn't want to use `random.choice()` - see the comments. – pobrelkey Dec 01 '13 at 01:17
6

`numpy.random.choice()` is completely different from `random.choice()` and supports probability distribution. – Eugene Pakhomov Jun 18 '16 at 06:25
Can't I use a function to define p? Why would I want to define it with numbers? – rjurney Feb 17 '21 at 21:17
If you want to sample from a specific distribution you should use a statistical package like `scipy.stats`or `statsmodels` and then get samples from the specific probability distribution you want to sample from. This question concerns the case of a user defined discrete distribution. – Heberto Mayorquin Apr 13 '22 at 06:15

score 18 · Answer 5 · answered Nov 24 '10 at 11:32

18

(OK, I know you are asking for shrink-wrap, but maybe those home-grown solutions just weren't succinct enough for your liking. :-)

pdf = [(1, 0.1), (2, 0.05), (3, 0.05), (4, 0.2), (5, 0.4), (6, 0.2)]
cdf = [(i, sum(p for j,p in pdf if j < i)) for i,_ in pdf]
R = max(i for r in [random.random()] for i,c in cdf if c <= r)

I pseudo-confirmed that this works by eyeballing the output of this expression:

sorted(max(i for r in [random.random()] for i,c in cdf if c <= r)
       for _ in range(1000))

answered Nov 24 '10 at 11:32

Marcelo Cantos

181,030
38
327
365

1

This looks impressive. Just to put things in context, here are the results from 3 consecutive executions of the above code: ['Count of 1 with prob: 0.1 is: 113', 'Count of 2 with prob: 0.05 is: 55', 'Count of 3 with prob: 0.05 is: 50', 'Count of 4 with prob: 0.2 is: 201', 'Count of 5 with prob: 0.4 is: 388', 'Count of 6 with prob: 0.2 is: 193']..............['Count of 1 with prob: 0.1 is: 77', 'Count of 2 with prob: 0.05 is: 60', 'Count of 3 with prob: 0.05 is: 51', 'Count of 4 with prob: 0.2 is: 193', 'Count of 5 with prob: 0.4 is: 438', 'Count of 6 with prob: 0.2 is: 181'] ............. and – Vaibhav Dec 29 '15 at 10:10
['Count of 1 with prob: 0.1 is: 84', 'Count of 2 with prob: 0.05 is: 52', 'Count of 3 with prob: 0.05 is: 53', 'Count of 4 with prob: 0.2 is: 210', 'Count of 5 with prob: 0.4 is: 405', 'Count of 6 with prob: 0.2 is: 196'] – Vaibhav Dec 29 '15 at 10:11
A question, how do I return max(i... , if 'i' is an object? – Vaibhav Dec 29 '15 at 11:33
@Vaibhav `i` isn't an object. – Marcelo Cantos Dec 31 '15 at 01:34

score 9 · Answer 6 · answered Apr 20 '19 at 11:03

I wrote a solution for drawing random samples from a custom continuous distribution.

I needed this for a similar use-case to yours (i.e. generating random dates with a given probability distribution).

You just need the funtion random_custDist and the line samples=random_custDist(x0,x1,custDist=custDist,size=1000). The rest is decoration ^^.

import numpy as np

#funtion
def random_custDist(x0,x1,custDist,size=None, nControl=10**6):
    #genearte a list of size random samples, obeying the distribution custDist
    #suggests random samples between x0 and x1 and accepts the suggestion with probability custDist(x)
    #custDist noes not need to be normalized. Add this condition to increase performance. 
    #Best performance for max_{x in [x0,x1]} custDist(x) = 1
    samples=[]
    nLoop=0
    while len(samples)<size and nLoop<nControl:
        x=np.random.uniform(low=x0,high=x1)
        prop=custDist(x)
        assert prop>=0 and prop<=1
        if np.random.uniform(low=0,high=1) <=prop:
            samples += [x]
        nLoop+=1
    return samples

#call
x0=2007
x1=2019
def custDist(x):
    if x<2010:
        return .3
    else:
        return (np.exp(x-2008)-1)/(np.exp(2019-2007)-1)
samples=random_custDist(x0,x1,custDist=custDist,size=1000)
print(samples)

#plot
import matplotlib.pyplot as plt
#hist
bins=np.linspace(x0,x1,int(x1-x0+1))
hist=np.histogram(samples, bins )[0]
hist=hist/np.sum(hist)
plt.bar( (bins[:-1]+bins[1:])/2, hist, width=.96, label='sample distribution')
#dist
grid=np.linspace(x0,x1,100)
discCustDist=np.array([custDist(x) for x in grid]) #distrete version
discCustDist*=1/(grid[1]-grid[0])/np.sum(discCustDist)
plt.plot(grid,discCustDist,label='custom distribustion (custDist)', color='C1', linewidth=4)
#decoration
plt.legend(loc=3,bbox_to_anchor=(1,0))
plt.show()

The performance of this solution is improvable for sure, but I prefer readability.

```assert prop>=0 and prop<=1 ``` Why would the density of a continuous distribution be under 1 ? — erwan gaymard, Jul 21 '21 at 09:55

score 2 · Answer 7 · edited May 23 '17 at 12:26

Make a list of items, based on their weights:

items = [1, 2, 3, 4, 5, 6]
probabilities= [0.1, 0.05, 0.05, 0.2, 0.4, 0.2]
# if the list of probs is normalized (sum(probs) == 1), omit this part
prob = sum(probabilities) # find sum of probs, to normalize them
c = (1.0)/prob # a multiplier to make a list of normalized probs
probabilities = map(lambda x: c*x, probabilities)
print probabilities

ml = max(probabilities, key=lambda x: len(str(x)) - str(x).find('.'))
ml = len(str(ml)) - str(ml).find('.') -1
amounts = [ int(x*(10**ml)) for x in probabilities]
itemsList = list()
for i in range(0, len(items)): # iterate through original items
  itemsList += items[i:i+1]*amounts[i]

# choose from itemsList randomly
print itemsList

An optimization may be to normalize amounts by the greatest common divisor, to make the target list smaller.

Also, this might be interesting.

If the list of items is large this might use a lot of extra memory. — pafcu, Nov 24 '10 at 11:39
@pafcu Agreed. Just a solution, the second which came to my mind (the first one was to search for something like "weight probability python" :) ). — khachik, Nov 24 '10 at 11:46

Saksham Varma · Answer 8 · 2015-05-02T00:40:13.740

from __future__ import division
import random
from collections import Counter


def num_gen(num_probs):
    # calculate minimum probability to normalize
    min_prob = min(prob for num, prob in num_probs)
    lst = []
    for num, prob in num_probs:
        # keep appending num to lst, proportional to its probability in the distribution
        for _ in range(int(prob/min_prob)):
            lst.append(num)
    # all elems in lst occur proportional to their distribution probablities
    while True:
        # pick a random index from lst
        ind = random.randint(0, len(lst)-1)
        yield lst[ind]

Verification:

gen = num_gen([(1, 0.1),
               (2, 0.05),
               (3, 0.05),
               (4, 0.2),
               (5, 0.4),
               (6, 0.2)])
lst = []
times = 10000
for _ in range(times):
    lst.append(next(gen))
# Verify the created distribution:
for item, count in Counter(lst).iteritems():
    print '%d has %f probability' % (item, count/times)

1 has 0.099737 probability
2 has 0.050022 probability
3 has 0.049996 probability 
4 has 0.200154 probability
5 has 0.399791 probability
6 has 0.200300 probability

Muayyad Alsadi · Answer 9 · 2016-04-26T09:48:54.750

based on other solutions, you generate accumulative distribution (as integer or float whatever you like), then you can use bisect to make it fast

this is a simple example (I used integers here)

l=[(20, 'foo'), (60, 'banana'), (10, 'monkey'), (10, 'monkey2')]
def get_cdf(l):
    ret=[]
    c=0
    for i in l: c+=i[0]; ret.append((c, i[1]))
    return ret

def get_random_item(cdf):
    return cdf[bisect.bisect_left(cdf, (random.randint(0, cdf[-1][0]),))][1]

cdf=get_cdf(l)
for i in range(100): print get_random_item(cdf),

the get_cdf function would convert it from 20, 60, 10, 10 into 20, 20+60, 20+60+10, 20+60+10+10

now we pick a random number up to 20+60+10+10 using random.randint then we use bisect to get the actual value in a fast way

Lucas Moeskops · Answer 10 · 2010-11-24T11:50:15.827

1

Another answer, probably faster :)

distribution = [(1, 0.2), (2, 0.3), (3, 0.5)]  
# init distribution  
dlist = []  
sumchance = 0  
for value, chance in distribution:  
    sumchance += chance  
    dlist.append((value, sumchance))  
assert sumchance == 1.0 # not good assert because of float equality  

# get random value  
r = random.random()  
# for small distributions use lineair search  
if len(distribution) < 64: # don't know exact speed limit  
    for value, sumchance in dlist:  
        if r < sumchance:  
            return value  
else:  
    # else (not implemented) binary search algorithm

edited Nov 24 '10 at 11:50

answered Nov 24 '10 at 11:38

Lucas Moeskops

5,445
3
28
42

Dose the `distribution` list need to be sorted by probability? – YQ.Wang Sep 10 '20 at 06:34
Doesn't need to be, but it will perform the fastest if it is sorted by probability biggest first. – Lucas Moeskops Sep 11 '20 at 09:07

Cris Stringfellow · Answer 11 · 2013-02-28T09:26:20.747

None of these answers is particularly clear or simple.

Here is a clear, simple method that is guaranteed to work.

accumulate_normalize_probabilities takes a dictionary p that maps symbols to probabilities OR frequencies. It outputs usable list of tuples from which to do selection.

def accumulate_normalize_values(p):
        pi = p.items() if isinstance(p,dict) else p
        accum_pi = []
        accum = 0
        for i in pi:
                accum_pi.append((i[0],i[1]+accum))
                accum += i[1]
        if accum == 0:
                raise Exception( "You are about to explode the universe. Continue ? Y/N " )
        normed_a = []
        for a in accum_pi:
                normed_a.append((a[0],a[1]*1.0/accum))
        return normed_a

Yields:

>>> accumulate_normalize_values( { 'a': 100, 'b' : 300, 'c' : 400, 'd' : 200  } )
[('a', 0.1), ('c', 0.5), ('b', 0.8), ('d', 1.0)]

Why it works

The accumulation step turns each symbol into an interval between itself and the previous symbols probability or frequency (or 0 in the case of the first symbol). These intervals can be used to select from (and thus sample the provided distribution) by simply stepping through the list until the random number in interval 0.0 -> 1.0 (prepared earlier) is less or equal to the current symbol's interval end-point.

The normalization releases us from the need to make sure everything sums to some value. After normalization the "vector" of probabilities sums to 1.0.

The rest of the code for selection and generating a arbitrarily long sample from the distribution is below :

def select(symbol_intervals,random):
        print symbol_intervals,random
        i = 0
        while random > symbol_intervals[i][1]:
                i += 1
                if i >= len(symbol_intervals):
                        raise Exception( "What did you DO to that poor list?" )
        return symbol_intervals[i][0]


def gen_random(alphabet,length,probabilities=None):
        from random import random
        from itertools import repeat
        if probabilities is None:
                probabilities = dict(zip(alphabet,repeat(1.0)))
        elif len(probabilities) > 0 and isinstance(probabilities[0],(int,long,float)):
                probabilities = dict(zip(alphabet,probabilities)) #ordered
        usable_probabilities = accumulate_normalize_values(probabilities)
        gen = []
        while len(gen) < length:
                gen.append(select(usable_probabilities,random()))
        return gen

Usage :

>>> gen_random (['a','b','c','d'],10,[100,300,400,200])
['d', 'b', 'b', 'a', 'c', 'c', 'b', 'c', 'c', 'c']   #<--- some of the time

score 0 · Answer 12 · answered Nov 24 '10 at 11:15

0

you might want to have a look at NumPy Random sampling distributions

answered Nov 24 '10 at 11:15

Manuel Salvadores

16,287
5
37
56

4

The numpy functions also seem to only support a limited number of distributions with no support for specifying your own. – pafcu Nov 24 '10 at 11:20
1

updated link https://docs.scipy.org/doc/numpy-1.14.0/reference/routines.random.html instead of http://docs.scipy.org/doc/numpy/reference/routines.random.html – Oct 12 '19 at 15:53

Vaibhav · Answer 13 · 2015-12-29T13:11:35.990

Here is a more effective way of doing this:

Just call the following function with your 'weights' array (assuming the indices as the corresponding items) and the no. of samples needed. This function can be easily modified to handle ordered pair.

Returns indexes (or items) sampled/picked (with replacement) using their respective probabilities:

def resample(weights, n):
    beta = 0

    # Caveat: Assign max weight to max*2 for best results
    max_w = max(weights)*2

    # Pick an item uniformly at random, to start with
    current_item = random.randint(0,n-1)
    result = []

    for i in range(n):
        beta += random.uniform(0,max_w)

        while weights[current_item] < beta:
            beta -= weights[current_item]
            current_item = (current_item + 1) % n   # cyclic
        else:
            result.append(current_item)
    return result

A short note on the concept used in the while loop. We reduce the current item's weight from cumulative beta, which is a cumulative value constructed uniformly at random, and increment current index in order to find the item, the weight of which matches the value of beta.

Generate random numbers with a given (numerical) distribution

13 Answers13

Linked

Related