3

I have:

  • cats, an array of 10 categories with shape (10,)
  • probs, an array of probabilities with shape (10, 50), representing the chance of each category being chosen for 50 different variables
  • n_choices, an array with shape (num_sims, 50) containing integers representing the number of categories to choose with replacement for each variable. For example, this could be 0 choices for variable 1, 33 for variable 2 etc
  • sims, an array filled with zeros with shape (num_sims, 50, 10), which will later be populated with results

What I am trying to do is as follows:

  • For each row in the array (representing one simulation), and each variable in that row, make N choices from 'cats', where N equals the corresponding value in 'n_choices'
  • Once the choices are made, add 1 to 'sims' for each time the category was chosen. In other words, I want to allocate out the values from 'n_choices' over the 10 categories based on 'probs', and save the results to 'sims'

Currently I have managed to get this working using loops, as you can see below. This is fine for a small number of sims, but in practice num_sims will be in the thousands, which means my code is far too slow.

def allocate_N(N, var_index):
  """Make N choices from cats for a given variable, and return
  the incides of each category
  var_index is the position of the variable in n_choices"""
  allocation = np.random.choice(cats, size=N, p=probs[:, var_index])
  allocation_sorted = np.argsort(cats)
  ypos = np.searchsorted(cats[allocation_sorted], allocation)
  cat_indices = allocation_sorted[ypos]
  return cat_indices

def add_to_sim(sims, cat_indices, var_index):
  """Takes the category indices from allocate_n and adds 1 to
  sims at the corresponding location for each occurrence of
  the category in cat_indices"""
  from collections import Counter
  a = Counter(list(cat_indices))
  vals = [1*a[j] for j in cat_indices]
  pos = [(var_index, x) for x in cat_indices]
  sims[tuple(np.transpose(pos))] = vals

# For each variable and each row in sims, make N allocations
# and add results to 'sims'
for var_index in range(len(n_choices.T)):
  sim_count = 0
  # slice is (vars x cats), a single row of 'sims'
  for slice in sims:      
    N = n_choices[sim_count, var_index]
    if N > 0:
      cat_indices = allocate_N(N, var_index)
      add_to_sim(slice, cat_indices, var_index)
    sim_count += 1

I am sure there must be a way to vectorize this? I was able to make a single random choice for each variable simultaneously using the approach here, but I wasn't sure how to apply that to my particular problem.

Thanks for your help!

Tom Whelan
  • 33
  • 5
  • 1
    Do choices per simulation and variable need to be unique? I think from the question it is inferred they do not, but just to make sure. Also, are the values of `cats` ever actually used in this particular problem, or just the indices of the categories? – jdehesa Aug 08 '19 at 10:09
  • No they don't need to be unique; multiple cats can be chosen for each variable. The values of cats are used when making a choice inside allocate_N (they are strings), but I'm then converting those values to indices anyway, so it might be more efficient to just use the indices from the start! – Tom Whelan Aug 08 '19 at 10:34

1 Answers1

1

What you seem to be describing are samples of a multinomial distribution. You can take samples from the distribution directly. Unfortunately, the parameters of the distribution (number of trials and probabilities) change for each simulation and variable, and neither np.random.multinomial nor scipy.stats.multinomial allow for vectorized sampling with multiple sets of parameters. This means that, if you want to do it like this, you would have to do it with loops still. At least, your code could be simplified to the following:

import numpy as np

np.random.seed(0)
# Problem size
n_cats = 10
n_vars = 50
n_sims = 100
n_maxchoices = 50
# Make example problem
probs = np.random.rand(n_cats, n_vars)
probs /= probs.sum(0)
n_choices = np.random.randint(n_maxchoices, size=(n_sims, n_vars))
sims = np.zeros((n_sims, n_vars, n_cats), np.int32)
# Sample multinomial distribution for each simulation and variable
for i_sim in range(n_sims):
    for i_var in range(n_vars):
        sims[i_sim, i_var] = np.random.multinomial(n_choices[i_sim, i_var],
                                                   probs[:, i_var])
# Check number of choices per simulation and variable is correct
print(np.all(sims.sum(2) == n_choices))
# True

Note you can still make this faster if you are willing to use Numba, with a function like this:

import numpy as np
import numba as nb

@nb.njit(parallel=True)
def make_simulations(probs, n_choices, sims):
    for i_sim in nb.prange(n_sims):
        for i_var in nb.prange(n_vars):
            sims[i_sim, i_var] = np.random.multinomial(n_choices[i_sim, i_var],
                                                       probs[:, i_var])

EDIT: A possible alternative solution that does not use multinomial sampling with just one loop could be this:

import numpy as np

np.random.seed(0)
# Problem size
n_cats = 10
n_vars = 50
n_sims = 100
n_maxchoices = 50
# Make example problem
probs = np.random.rand(n_cats, n_vars)
probs /= probs.sum(0)
n_choices = np.random.randint(n_maxchoices, size=(n_sims, n_vars))
sims = np.zeros((n_sims, n_vars, n_cats), np.int32)
# Fill simulations array
n_choices_var = n_choices.sum(0)
sims_r = np.arange(n_sims)
# For each variable
for i_var in range(n_vars):
    # Take choices for all simulations
    choices_var = np.random.choice(n_cats, n_choices_var[i_var], p=probs[:, i_var])
    # Increment choices counts in simulations array
    i_sim = np.repeat(sims_r, n_choices[:, i_var])
    np.add.at(sims, (i_sim, i_var, choices_var), 1)
# Check result
print(np.all(sims.sum(2) == n_choices))
# True

I am not sure if this would actually be faster, since it generates many intermediate arrays. I suppose it depends on the particular parameters of the problem, but I would be surprised if the Numba solution is not the fastest one.

jdehesa
  • 58,456
  • 7
  • 77
  • 121
  • This is excellent, thanks very much for such a detailed, speedy response! Appears to be exactly what I was looking for. – Tom Whelan Aug 08 '19 at 11:36