2

I want to simulate n random choices with given probabilities prob.

My current solution is the following:

from random import choices

result = [0]*len(prob)
population = list(range(0,len(prob)))
ch = choices(population, weights=prob, k=n)
for i in ch:
   result[i] += 1

My problem is that I call this code a large number of times and usually with large n and this solution does not seem efficient at all.

Is there a better way of doing it (like a pre-build function of some library)?

To sumarize I want the most efficient way of building a random list summing to $n$ such that the probability of obtaining a given list is equal to the probability of obtaining this list as n random choices with probability prob.

Thank you

[EDIT to add context]

What I am really doing is n random walks in a kind of Markov chain as follow:

def rand_walk(n,state):
    next_states,prob = complicated_function(state) // compute the next states and their probability
    succ = distribute_over_next_states(n, prob) // compute how many walk goes to which states
    accu = complicated_function_2(state) // accumulator for the result
    for ns in range(0,len(next_states)):
        accu += rand_walk(succ[i],next_states[I])
    return accu

The point is that the computation of the next states and their probability is costly thus I avoid computing it to many times (thus I avoid doing n runs sequential). That is why I want to distribute the n following the given probability.

I hope this is somehow clear enough to understand ...

wece
  • 123
  • 4
  • Possible duplicate of [Generating a list of random numbers, summing to 1](https://stackoverflow.com/questions/18659858/generating-a-list-of-random-numbers-summing-to-1) – Mathieu Jul 05 '18 at 12:50
  • my guess would be using numpy. There should be some efficient random implementations. – TomBombadil Jul 05 '18 at 12:50
  • @Mathieu I would like to have integers. I don't think the pointed question request integers. Does it? – wece Jul 05 '18 at 12:52
  • https://docs.scipy.org/doc/numpy-1.14.0/reference/routines.random.html is the only one i know – TomBombadil Jul 05 '18 at 12:53
  • @wece hmm i am sry then. I will search for an answer. Actually you already got one using numpy :D. – TomBombadil Jul 05 '18 at 12:58
  • Let us [continue this discussion in chat](https://chat.stackoverflow.com/rooms/174422/discussion-between-tombombadil-and-wece). – TomBombadil Jul 05 '18 at 12:59
  • How random are the numbers supposed to be? E.g., with your appraoch, for large `n` each count should end up pretty much exactly at the probability for that position, whereas "true" random numbers should allow wider variation, as long as the sum remains the same. – tobias_k Jul 05 '18 at 13:06
  • @tobias_k They must be as random as n random choices with the given probabilities ... For n large enough it will indeed be close to the probability times n but that's not random enough for me. – wece Jul 05 '18 at 13:19
  • @wece I am not sure I understand what you mean with that. Do you mean that each list that sums to N should be equally likely? – tobias_k Jul 05 '18 at 13:29
  • @wece True, I was a bit fast. Here is the right one: https://stackoverflow.com/questions/3589214/generate-multiple-random-numbers-to-equal-a-value-in-python – Mathieu Jul 05 '18 at 13:32
  • @Mathieu Those are uniform random numbers. The values we are generating are distributed normally about `n/len(prob)`. – FHTMitchell Jul 05 '18 at 13:43
  • @FHTMitchell I edited to explain what I wanted to do. – wece Jul 05 '18 at 13:47
  • @tobias_k I hope it is clearer with the edit – wece Jul 05 '18 at 13:47

2 Answers2

1

use numpy and collections.Counter? Counter is a mapping (dict-subclass) that makes more sense for the results you are trying to describe than a list.

import random
import collections
import numpy as np

def f(population, prob, n):
    return collections.Counter(np.random.choice(population, p=prob/np.sum(prob), size=n))

x = list(range(5))

f(x, x, 10000) # --> Counter({1: 986, 2: 1993, 3: 3009, 4: 4012})

Timings

%timeit f(x, x, 10000)
1.87 ms ± 18.2 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

# for comparison
def g(population, prob, n):
    return collections.Counter(random.choices(population, weights=prob, k=n))

%timeit g(x, x, 10000)
4.97 ms ± 56.8 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

%timeit your_function_in_op(x, x, 10000)
5.37 ms ± 32.8 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
FHTMitchell
  • 11,793
  • 2
  • 35
  • 47
  • Thank it is indeed faster ! Do you think we could get any improvement from this? – wece Jul 05 '18 at 13:17
  • Probably. You'd have to say in detail what values are in population (are they always integers?) and what you want to use the resultant data for. – FHTMitchell Jul 05 '18 at 13:23
  • The population is always range(0,len(prob)). The result as dict is fine with me. I then call a function with arguments i,res[I]. – wece Jul 05 '18 at 13:27
  • But what is that function you call doing? – FHTMitchell Jul 05 '18 at 13:28
1

Hmmph. Numpy's already implemented multinomial draws, so we don't even need the via_binomial function:

In [56]: np.random.multinomial(10**8, [0.2, 0.3, 0.5])
Out[56]: array([20003098, 29996630, 50000272])

In [57]: via_binomial(10**8, [0.2, 0.3, 0.5])
Out[57]: [19993527, 30000996, 50005477]

IIUC, you can treat this as a series of repeated binomial draws:

from random import choices
import numpy as np

def original(n, prob):
    result = [0]*len(prob)
    population = list(range(0,len(prob)))
    ch = choices(population, weights=prob, k=n)
    for i in ch:
       result[i] += 1
    return result

def via_binomial(n, prob):
    result = []
    already_handled_prob = 0
    n_left = n
    for p in prob[:-1]:
        draw_prob = p / (1 - already_handled_prob)
        result.append(np.random.binomial(n_left, draw_prob))
        already_handled_prob += p
        n_left -= result[-1]
    result.append(n-sum(result))
    return result

gives me

In [29]: %time original(10**6, [1])
Wall time: 343 ms
Out[29]: [1000000]

In [30]: %time via_binomial(10**6, [1])
Wall time: 0 ns
Out[30]: [1000000]

In [31]: %time original(10**6, [0.25, 0.75])
Wall time: 343 ms
Out[31]: [249944, 750056]

In [32]: %time via_binomial(10**6, [0.25, 0.75])
Wall time: 0 ns
Out[32]: [250030, 749970]

In [33]: %time original(10**8, [0.4, 0.3, 0.2, 0.1])
Wall time: 40.1 s
Out[33]: [40004163, 29999878, 19992540, 10003419]

In [34]: %time via_binomial(10**8, [0.4, 0.3, 0.2, 0.1])
Wall time: 0 ns
Out[34]: [39997530, 29999334, 20003182, 9999954]

In [35]: %time via_binomial(10**8, [0.4, 0.3, 0.2, 0.1])
Wall time: 0 ns
Out[35]: [40009324, 29995955, 19996223, 9998498]

(Okay, I'm kind of cheating using %time and not %timeit there. :-) On my machine it's about ~3-10 us.)

DSM
  • 342,061
  • 65
  • 592
  • 494
  • Wow this is great ! exactly what I was looking for. Thanks – wece Jul 05 '18 at 14:17
  • @wece: hold up, it looks like numpy's already implemented the multinomial distribution, so you might not need this at all.. – DSM Jul 05 '18 at 14:18
  • Ha ! yes they did: np.random.multinomial is the function I'm looking for. Thank you for pointing that to me :) – wece Jul 05 '18 at 14:21