1

I am trying to apply np.random.choice to a big array with different weights, and wondering any way could avoid looping and improve the performance? Over here len(weights) could be millions.

weights = [[0.1, 0.5, 0.4],
           [0.2, 0.4, 0.4],
           ...
           [0.3, 0.3, 0.4]]

choice = [1, 2, 3]
ret = np.zeros((len(weights), 20))
for i in range(len(weights)):
    ret[i] = np.random.choice(choice, 20, p=weights[i])
Nick
  • 73
  • 6
  • I am afraid that you can't help but looping. Perhaps, you could still speed things up by using `numba`? One thing that could help is, if you have a lot of `weights` duplicates, you could generate those with a single `np.random.choice()` call and put it in the right `ret` place afterwards. Also, you may want to change: `for i in range(len(weights)): ...` into `for i, weight in enumerate(weights): ...` (but that's not going to buy you speed, it is just stylistic good practice). – norok2 Jul 28 '19 at 03:07
  • @WarrenWeckesser How would they incorporate number of items in the linked one? – Divakar Jul 28 '19 at 06:57
  • @Divakar, well, you know, broadcasting, etc. :) OK, it isn't an exact duplicate. – Warren Weckesser Jul 28 '19 at 07:24
  • That is the shape of 'weights'? It may be useful for benchmarking – tstanisl Jul 28 '19 at 07:40

3 Answers3

4

Here's a generalization of my answer in Fast random weighted selection across all rows of a stochastic matrix:

def vectorized_choice(p, n, items=None):
    s = p.cumsum(axis=1)
    r = np.random.rand(p.shape[0], n, 1)
    q = np.expand_dims(s, 1) >= r
    k = q.argmax(axis=-1)
    if items is not None:
        k = np.asarray(items)[k]
    return k

p is expected to be a two-dimensional array whose rows are probability vectors. n is the number of samples to draw from the distribution defined by each row. If items is None, the samples are integers in range(0, p.shape[1]). If items is not None, it is expected to be a sequence with length p.shape[1].

Example:

In [258]: p = np.array([[0.1, 0.5, 0.4], [0.75, 0, 0.25], [0, 0, 1], [1/3, 1/3, 1/3]])                                                   

In [259]: p                                                                                                                              
Out[259]: 
array([[0.1       , 0.5       , 0.4       ],
       [0.75      , 0.        , 0.25      ],
       [0.        , 0.        , 1.        ],
       [0.33333333, 0.33333333, 0.33333333]])

In [260]: vectorized_choice(p, 20)                                                                                                       
Out[260]: 
array([[1, 1, 2, 1, 1, 2, 2, 2, 1, 2, 1, 1, 1, 2, 2, 0, 1, 2, 2, 2],
       [0, 2, 0, 0, 0, 0, 0, 0, 2, 0, 0, 0, 0, 0, 0, 2, 0, 0, 0, 0],
       [2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2],
       [1, 0, 2, 2, 0, 1, 2, 1, 0, 0, 0, 0, 2, 2, 0, 0, 2, 1, 1, 2]])

In [261]: vectorized_choice(p, 20, items=[1, 2, 3])                                                                                      
Out[261]: 
array([[2, 1, 2, 2, 2, 3, 2, 2, 2, 2, 3, 3, 2, 2, 3, 3, 2, 3, 2, 2],
       [1, 1, 1, 1, 1, 1, 1, 1, 3, 3, 3, 1, 1, 3, 3, 1, 3, 1, 1, 1],
       [3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3],
       [3, 3, 3, 1, 3, 2, 1, 2, 3, 1, 2, 2, 3, 2, 1, 2, 1, 2, 2, 2]])

Timing for p with shape (1000000, 3):

In [317]: p = np.random.rand(1000000, 3)

In [318]: p /= p.sum(axis=1, keepdims=True)

In [319]: %timeit vectorized_choice(p, 20, items=np.arange(1, p.shape[1]+1))
1.89 s ± 28.1 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Here's the timing for Divakar's function:

In [320]: %timeit random_choice_prob_vectorized(p, 20, choice=np.arange(1, p.shape[1]+1))
7.33 s ± 43.7 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

The difference will be less pronounced if you increase the number of columns in p, and if you make the number of columns big enough, Divakar's function will be faster. E.g.

In [321]: p = np.random.rand(1000, 120)

In [322]: p /= p.sum(axis=1, keepdims=True)

In [323]: %timeit vectorized_choice(p, 20, items=np.arange(1, p.shape[1]+1))
6.41 ms ± 20.5 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

In [324]: %timeit random_choice_prob_vectorized(p, 20, choice=np.arange(1, p.shape[1]+1))
6.29 ms ± 342 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
Warren Weckesser
  • 110,654
  • 19
  • 194
  • 214
2

Borrowing idea from Vectorizing numpy.random.choice for given 2D array of probabilities along an axis alongwith idea from vectorized searchsorted, here's one vectorized way -

def random_choice_prob_vectorized(weights, num_items, choice=None):
    weights = np.asarray(weights)

    w = weights.cumsum(1)
    r = np.random.rand(len(weights),num_items)

    m,n = w.shape
    o = np.arange(m)[:,None]
    w_o = (w+o).ravel()
    r_o = (r+o).ravel()
    idx = np.searchsorted(w_o,r_o).reshape(m,-1)%n
    if choice is not None:
        return np.asarray(choice)[idx]
    else:
        return idx

Sample run to verify using 2D bincount -

In [28]: weights = [[0.1, 0.5, 0.4],
    ...:            [0.2, 0.4, 0.4],
    ...:            [0.3, 0.3, 0.4]]
    ...: 
    ...: choice = [1, 2, 3]
    ...: num_items = 20000

In [29]: out = random_choice_prob_vectorized(weights, num_items, choice)

# Use 2D bincount to get per average occurences and verify against weights
In [75]: bincount2D_vectorized(out)/num_items
Out[75]: 
array([[0.     , 0.09715, 0.4988 , 0.40405],
       [0.     , 0.1983 , 0.40235, 0.39935],
       [0.     , 0.30025, 0.29485, 0.4049 ]])
Divakar
  • 218,885
  • 19
  • 262
  • 358
0

Looks like each row of the resulting array is independent of other rows. I am not sure how bad is the performance now. If it really is a concern, I would try to use python's multiprocessing module to run random number generations with several processes in parallel. It should help.

niuer
  • 1,589
  • 2
  • 11
  • 14