6

Annoyingly, the following doesn't work:

from collections import Counter
import random

c = Counter([1,1,1,1,0,0])
random.choice(c) # I expect this to return 1 with probability 2/3, 
                 # and 0 with probability 1/3.
                 # It actually returns 4 or 2, with probability 1/2

What is the idiomatic way to sample from a multiset in Python (any version)?

Edit yes, I do really need to use a multiset. My actual data is much bigger and just storing it in a list would not be practical.

Edit 2 I need to do this with a reasonable degree of efficiency, as my code will do this repeatedly. There will be a lot of data stored in the Counter object, and anything that involves copying all of this data into a new data structure is not going to be a viable solution.

N. Virgo
  • 7,970
  • 11
  • 44
  • 65
  • 3
    As always, Eli Bendersky's page on [weighted random selection](http://eli.thegreenplace.net/2010/01/22/weighted-random-generation-in-python/) in Python makes for useful reading. – DSM Apr 13 '14 at 05:45

5 Answers5

5

From the docs:

A common task is to make a random.choice() with weighted probabilities.

If the weights are small integer ratios, a simple technique is to build a sample population with repeats:

>>> weighted_choices = [('Red', 3), ('Blue', 2), ('Yellow', 1), ('Green', 4)]
>>> population = [val for val, cnt in weighted_choices for i in range(cnt)]
>>> random.choice(population)
'Green'

A more general approach is to arrange the weights in a cumulative distribution with itertools.accumulate(), and then locate the random value with bisect.bisect():

>>> choices, weights = zip(*weighted_choices)
>>> cumdist = list(itertools.accumulate(weights))
>>> x = random.random() * cumdist[-1]
>>> choices[bisect.bisect(cumdist, x)]
'Blue'

For your application, you will probably want to use the Counter to build a list of choices and a list of cumulative probabilities, then sample with the second technique.

user2357112
  • 260,549
  • 28
  • 431
  • 505
  • Ok, so the idiomatic solution is just to roll my own code. Now I have to work out how to do this using a Counter instead of an ordered list. (I definitely don't want to create a new ordered list every time I need to sample from the Counter, as the data is very large.) – N. Virgo Apr 13 '14 at 05:36
  • @Nathaniel: Are you going to be updating the Counter frequently? The lists are reusable as long as you don't modify the Counter, and building them is likely to be much more expensive than the binary search step. – user2357112 Apr 13 '14 at 05:48
  • @user2357112 every time I sample from the counter I remove the corresponding item, so unfortunately I would have to re-build the lists every time. – N. Virgo Apr 13 '14 at 05:50
  • @Nathaniel: Are there going to be insertions interspersed with these removals? If not, there are other efficient techniques you can use to avoid rebuilding the lists. – user2357112 Apr 13 '14 at 05:54
  • @user2357112 sadly yes - on average, for every item that's removed, another is inserted. – N. Virgo Apr 13 '14 at 05:55
  • @Nathaniel: Bummer. I believe you can still get efficient insertion and sampling by using a weight-balanced binary tree, but you would probably have to implement your own weight-balanced binary tree for that. Take a look at DSM's link; I believe a version of the algorithm under "Giving up the temporary list" may be the best balance of simplicity and efficiency for your case. – user2357112 Apr 13 '14 at 06:09
4

you can do this with the built in random.choices in python >= 3.6

from collections import Counter
import random

c = Counter([1,1,1,1,0,0])
random.choices(list(c.keys()), weights=list(c.values()), k=1)

note: ordering of dict keys is guaranteed in python >= 3.7 so the example code will run in python >= 3.7. but a similar solution is possible in python 3.6.

lunguini
  • 896
  • 1
  • 9
  • 14
  • 2
    Are we guaranteed that `list(c.keys())` and `list(c.values())` will be in the same order? I had a look through the docs but it was hard to tell. To get around it, one could do `keys, counts = zip(*c.items())` and then `random.choices(keys, counts, k=1)` instead. It always bugs me when I have to build a list only to iterate over it - at the time of writing the question, I wanted to avoid copying all the data into a new structure - but it's great that there's an idiomatic way to do this now. – N. Virgo Oct 15 '20 at 04:14
  • good point, ordering of dict keys [is guaranteed in python >= 3.7](https://stackoverflow.com/a/40007169/3242418), i will edit the answer to note that! – lunguini Oct 15 '20 at 05:12
  • This should be the selected answer for this question. Requires no extra complicated libraries. – Jake Poznanski Jan 27 '21 at 22:53
3

Since this question has got some attention recently, I thought I'd give an answer of my own. Doing this efficiently in Python does seem to involve rolling your own code, but I found an algorithm described on a machine learning blog that's efficient even if the contents of the set keep changing, and can be implemented quite easily. The blog post includes a basic Python implementation and links to a fast Cython implementation.

N. Virgo
  • 7,970
  • 11
  • 44
  • 65
2

I have a similar problem, but the Counter I have is changing repeatedly and the number of elements in the Counter is generally small (no more than 100)

I ended up using the following as a more efficient solution

c = Counter([1,1,1,1,0,0])
random.choice(list(c.elements()))
kon psych
  • 626
  • 1
  • 11
  • 26
0

In all modern Python releases, you can use random.choices() which supports making one or many selections from a population and allows the choices to be weighted.

This example is taken directly from the recipes in the Python docs:

>>> # Six roulette wheel spins (weighted sampling with replacement)
>>> choices(['red', 'black', 'green'], [18, 18, 2], k=6)
['red', 'green', 'black', 'black', 'red', 'black']

Here is the technique applied to a Counter (multiset) to make ten random weighted selections:

>>> from collections import Counter
>>> from random import choices
>>> c = Counter([1,1,1,1,0,0])
>>> choices(population=list(c), weights=c.values(), k=10)
[1, 0, 0, 1, 0, 1, 0, 1, 1, 1]
Raymond Hettinger
  • 216,523
  • 63
  • 388
  • 485