I have:
- cats, an array of 10 categories with shape (10,)
- probs, an array of probabilities with shape (10, 50), representing the chance of each category being chosen for 50 different variables
- n_choices, an array with shape (num_sims, 50) containing integers representing the number of categories to choose with replacement for each variable. For example, this could be 0 choices for variable 1, 33 for variable 2 etc
- sims, an array filled with zeros with shape (num_sims, 50, 10), which will later be populated with results
What I am trying to do is as follows:
- For each row in the array (representing one simulation), and each variable in that row, make N choices from 'cats', where N equals the corresponding value in 'n_choices'
- Once the choices are made, add 1 to 'sims' for each time the category was chosen. In other words, I want to allocate out the values from 'n_choices' over the 10 categories based on 'probs', and save the results to 'sims'
Currently I have managed to get this working using loops, as you can see below. This is fine for a small number of sims, but in practice num_sims will be in the thousands, which means my code is far too slow.
def allocate_N(N, var_index):
"""Make N choices from cats for a given variable, and return
the incides of each category
var_index is the position of the variable in n_choices"""
allocation = np.random.choice(cats, size=N, p=probs[:, var_index])
allocation_sorted = np.argsort(cats)
ypos = np.searchsorted(cats[allocation_sorted], allocation)
cat_indices = allocation_sorted[ypos]
return cat_indices
def add_to_sim(sims, cat_indices, var_index):
"""Takes the category indices from allocate_n and adds 1 to
sims at the corresponding location for each occurrence of
the category in cat_indices"""
from collections import Counter
a = Counter(list(cat_indices))
vals = [1*a[j] for j in cat_indices]
pos = [(var_index, x) for x in cat_indices]
sims[tuple(np.transpose(pos))] = vals
# For each variable and each row in sims, make N allocations
# and add results to 'sims'
for var_index in range(len(n_choices.T)):
sim_count = 0
# slice is (vars x cats), a single row of 'sims'
for slice in sims:
N = n_choices[sim_count, var_index]
if N > 0:
cat_indices = allocate_N(N, var_index)
add_to_sim(slice, cat_indices, var_index)
sim_count += 1
I am sure there must be a way to vectorize this? I was able to make a single random choice for each variable simultaneously using the approach here, but I wasn't sure how to apply that to my particular problem.
Thanks for your help!