How to do weighted random sample of categories in python

Question

Given a list of tuples where each tuple consists of a probability and an item I'd like to sample an item according to its probability. For example, give the list [ (.3, 'a'), (.4, 'b'), (.3, 'c')] I'd like to sample 'b' 40% of the time.

What's the canonical way of doing this in python?

I've looked at the random module which doesn't seem to have an appropriate function and at numpy.random which although it has a multinomial function doesn't seem to return the results in a nice form for this problem. I'm basically looking for something like mnrnd in matlab.

Many thanks.

Thanks for all the answers so quickly. To clarify, I'm not looking for explanations of how to write a sampling scheme, but rather to be pointed to an easy way to sample from a multinomial distribution given a set of objects and weights, or to be told that no such function exists in a standard library and so one should write one's own.

This has been covered before. Here's my own take: http://stackoverflow.com/questions/3655430/selection-based-on-percentage-weighting/3655542#3655542 — Mark Ransom, Jun 21 '11 at 22:47
possible duplicate of [A weighted version of random.choice](http://stackoverflow.com/questions/3679694/a-weighted-version-of-random-choice) — Mark Ransom, Jun 21 '11 at 22:50
numpy.random.multniomial IS mnrnd. More specifically do you want a list where order is not important eg [a,a,a,b,b,b,b,c,c,c] or a distribution with a pseudo random order. — Dunes, Jun 21 '11 at 23:13
This is called the [categorical distribution](http://en.wikipedia.org/wiki/Categorical_distribution), by the way. — Faheem Mitha, Dec 02 '11 at 04:40

score 19 · Answer 1 · edited Jun 22 '11 at 04:42

19

This might do what you want:

numpy.array([.3,.4,.3]).cumsum().searchsorted(numpy.random.sample(5))

edited Jun 22 '11 at 04:42

JoshAdel

66,734
27
141
140

answered Jun 21 '11 at 22:18

sholte

1,295
1
9
11

Works for numeric choices, but can be generalized by splitting a dictionary into a probabilities array and a values array, and returning a `sampler` function. Despite the complication, and assuming one returns a sampler so one doesn't have to recompute the cumulative sum, +1 because is efficient for large arrays due to numpy doing binary search. – ninjagecko Jun 21 '11 at 23:23
4

I wish someone would explain this better. If `[.3,.4,.3]` are the weights, how are we supposed to get the values attached to them? – Daniel Quinn Sep 13 '11 at 17:05
@DanielQuinn in this case, let the sampled probabilities be `sampled` and values be `vals=['a','b','c']`. Then, the sampled values are simply `map(lambda x:vals[x], sampled)`. – syockit Feb 19 '15 at 23:20

score 14 · Answer 2 · answered Sep 30 '15 at 06:20

14

Since nobody used the numpy.random.choice function, here's one that will generate what you need in a single, compact line:

numpy.random.choice(['a','b','c'], size = 20, p = [0.3,0.4,0.3])

answered Sep 30 '15 at 06:20

JP_smasher

981
12
15

This is the simplest solution. Is the p argument to random.choice relatively new? – Joseph Sheedy Feb 08 '17 at 04:04
@velotron has been around for quite some time I believe (given that the answer worked in sep 2015). – JP_smasher Feb 08 '17 at 08:06
This is working great for me here in 2017, I was just curious since the solutions around the time of the original question in 2011 are all lengthier. – Joseph Sheedy Feb 08 '17 at 23:36
One line, intuitive to understand years later. This should be the awarded answer. – Marc Maxmeister Aug 17 '18 at 18:26

phant0m · Accepted Answer · 2011-06-22T15:40:04.460

11

import numpy

n = 1000
pairs = [(.3, 'a'), (.3, 'b'), (.4, 'c')]
probabilities = numpy.random.multinomial(n, zip(*pairs)[0])
result = zip(probabilities, zip(*pairs)[1])
# [(299, 'a'), (299, 'b'), (402, 'c')]
[x[0] * x[1] for x in result]
# ['aaaaaaaaaa', 'bbbbbbbbbbbbbbbbbbb', 'cccccccccccccccccccc']

How exactly would you like to receive the results?

edited Jun 22 '11 at 15:40

answered Jun 21 '11 at 22:24

phant0m

16,595
5
50
82

@John: I have exchanged the `reduce()`-madness for a more readable list-comphehension. (I'm not sure whether you get notified if I edit my post now...) – phant0m Jun 22 '11 at 15:41
@John: FWIW, IMHO `sholte`'s answer is much more straightforward one. And it can be extend to handle arbitrary items very simple manner (as demonstrated). Thanks – eat Jun 22 '11 at 18:32
@eat: You can modify my code to make similar results to sholte's: `numpy.random.multinomial(5, [.3, .3, .4])` - this might return: array([2, 2, 1]). sholte's equivalent result might look like this: array([1, 0, 2, 0, 1]). I don't see how his code would be more straightforward than that. If you care about the order, his result would be more useful, if you don't, mine would be. Anyhow, I have added code to take his input, work it into my code, and bring the result back into a form I thought might be of use to him. – phant0m Jun 22 '11 at 19:46
I have updated my answer. Please note that, when commenting, my primary concern was in readability of the code. Anyway your answer is correct and after some mental wrestling it reveals the beauty of `multinomial`'s. Thanks – eat Jun 22 '11 at 22:19
Yes, it really isn't readable :) I just stuck to the input as provided by John, which is why it turned out slightly ugly :) - `wrestling` is a good way to put it. Yours does look very clean now. – phant0m Jun 22 '11 at 22:28

ninjagecko · Answer 4 · 2011-06-21T23:24:37.540

There are hacks you can do if, for example, your probabilities fit nicely into percentages, etc.

For example, if you're fine with percentages, the following will work (at the cost of a high memory overhead):

But the "real" way to do it with arbitrary float probabilities is to sample from the cumulative distribution, after constructing it. This is equivalent to subdividing the unit interval [0,1] into 3 line segments labelled 'a','b', and 'c'; then picking a random point on the unit interval and seeing which line segment it it.

#!/usr/bin/python3
def randomCategory(probDict):
    """
        >>> dist = {'a':.1, 'b':.2, 'c':.3, 'd':.4}

        >>> [randomCategory(dist) for _ in range(5)]
        ['c', 'c', 'a', 'd', 'c']

        >>> Counter(randomCategory(dist) for _ in range(10**5))
        Counter({'d': 40127, 'c': 29975, 'b': 19873, 'a': 10025})
    """
    r = random.random() # range: [0,1)
    total = 0           # range: [0,1]
    for value,prob in probDict.items():
        total += prob
        if total>r:
            return value
    raise Exception('distribution not normalized: {probs}'.format(probs=probDict))

One has to be careful of methods which return values even if their probability is 0. Fortunately this method does not, but just in case, one could insert if prob==0: continue.

For the record, here's the hackish way to do it:

import random

def makeSampler(probDict):
    """
        >>> sampler = makeSampler({'a':0.3, 'b':0.4, 'c':0.3})
        >>> sampler.sample()
        'a'
        >>> sampler.sample()
        'c'
    """
    oneHundredElements = sum(([val]*(prob*100) for val,prob in probDict.items()), [])
    def sampler():
        return random.choice(oneHundredElements)
    return sampler

However if you don't have resolution issues... this is actually probably the fastest way possible. =)

-1 for the "hackish" way with percentages, but +10 for the cumulative distribution ! — ThibThib, Jun 21 '11 at 22:20
I have a doubt: probDict.items() does not have a defined order, couldn't it happen that it won't always return the (k, v) pairs in the same order, which would lead to an uneven distribution? — phant0m, Jun 22 '11 at 10:28
@phant0m: This is not an issue because it doesn't matter which order you go in. Any algorithm given here should work for `[('a',0.2),('b',0.8)]` or `[('b',0.8),('a',0.2)]`. The alternative would be to pick a random order and always use that one, by returning the usual `sample()` generator. My previous solution did this, and it is more memory. There is nothing to gain unless you can take advantage of a strategy to presort them into some weird fractal structure such that performing binary search results in a significant speedup for distributions with many many possible values... — ninjagecko, Jun 22 '11 at 21:35
I'm not sure this is what I meant: If you call randomCategory() for the first time, probDict.items() might return `[('a',0.2),('b',0.8)]`, but if you call it the second time, it might return `[('b',0.8),('a',0.2)]`. An analogy perhaps: Say you have one big bucket (b: 0.8), and a small bucket (a: 0.2). You throw coins into them, always hit one, never miss. If you were to continuously move the buckets (thinking in 1d) - or switch, rather - would this affect the outcome of the experiment? When I think about it now, with the analogy, I'd say no though :) — phant0m, Jun 22 '11 at 21:45

score 1 · Answer 5 · answered Jun 21 '11 at 22:04

1

Howabout creating 3 "a", 4 "b" and 3 "c" in a list an then just randomly select one. With enough iterations you will get the desired probability.

answered Jun 21 '11 at 22:04

Fredrik Pihl

44,604
7
83
130

score 1 · Answer 6 · edited Oct 19 '11 at 03:44

I reckon the multinomial function is a still fairly easy way to get samples of a distribution in random order. This is just one way

import numpy
from itertools import izip

def getSamples(input, size):
    probabilities, items = zip(*input)
    sampleCounts = numpy.random.multinomial(size, probabilities)
    samples = numpy.array(tuple(countsToSamples(sampleCounts, items)))
    numpy.random.shuffle(samples)
    return samples

def countsToSamples(counts, items):
    for value, repeats in izip(items, counts):
        for _i in xrange(repeats):
            yield value

Where inputs is as specified [(.2, 'a'), (.4, 'b'), (.3, 'c')] and size is the number of samples you need.

Astrid · Answer 7 · 2017-11-24T17:06:00.093

0

This may be of marginal benefit but I did it this way:

import scipy.stats as sps
N=1000
M3 = sps.multinomial.rvs(1, p = [0.3,0.4,0.3], size=N, random_state=None)
M3a = [ np.where(r==1)[0][0] for r in M3 ] # convert 1-hot encoding to integers

This is similar to @eat's answer.

edited Nov 24 '17 at 17:06

answered Nov 24 '17 at 16:59

Astrid

1,846
4
26
48

score 0 · Answer 8 · answered Jun 21 '11 at 22:04

0

I'm not sure if this is the pythonic way of doing what you ask, but you could use random.sample(['a','a','a','b','b','b','b','c','c','c'],k) where k is the number of samples you want.

For a more robust method, bisect the unit interval into sections based on the cumulative probability and draw from the uniform distribution (0,1) using random.random(). In this case the subintervals would be (0,.3)(.3,.7)(.7,1). You choose the element based on which subinterval it falls into.

answered Jun 21 '11 at 22:04

Marty B

243
3
10

Regarding your description of the unit-interval method, you have to handle the cases where it falls between intervals and if there are intervals of 0 length. – ninjagecko Jun 21 '11 at 23:33
The probability of a random number between 0 and 1 lying between the intervals is 0. An interval of 0 length has 0 probability of occuring. – Marty B Jun 21 '11 at 23:44
Mathematically, yes. However this is not true with floating-point arithmetic. – ninjagecko Jun 21 '11 at 23:49
This will only matter if the endpoints of the intervals are representable by floating point numbers, and if the extra probability of 1/(2^53) matters the op should probably roll his/her own functions. – Marty B Jun 21 '11 at 23:57

eat · Answer 9 · 2011-06-22T22:11:32.910

Just inspired of sholte's very straightforward (and correct) answer: I'll just demonstrate how easy it will be to extend it to handle arbitrary items, like:

In []: s= array([.3, .4, .3]).cumsum().searchsorted(sample(54))
In []: c, _= histogram(s, bins= arange(4))
In []: [item* c[i] for i, item in enumerate('abc')]
Out[]: ['aaaaaaaaaaaa', 'bbbbbbbbbbbbbbbbbbbbbbbbbb', 'cccccccccccccccc']

Update:
Based on the feedback of phant0m, it turns out that an even more straightforward solution can be implemented based on multinomial, like:

In []: s= multinomial(54, [.3, .4, .3])
In []: [item* s[i] for i, item in enumerate('abc')]
Out[]: ['aaaaaaaaaaaaaaa', 'bbbbbbbbbbbbbbbbbbbbbbbbbbb', 'cccccccccccc']

IMHO here we have a nice summary of empirical cdf and multinomial based sampling yielding similar results. So, in a summary, pick it up one which suits best for your purposes.

How to do weighted random sample of categories in python

9 Answers9

Linked