0

I'm trying to come up with a fast and smart way of generating random vectors from a distribution matrix, much like what is being discussed here: Generate random numbers with a given (numerical) distribution

But the key difference is that I have a distribution matrix, rather than just a single vector.

Now obviously I could just create a for loop and loop over each vector in my matrix, but that doesn't seem very pythonic, or fast, so I'm kinda hoping that there is a better way of doing it.

To give a quick example of what I want to do: Given a one-probability matrix

p = [[0.2, 0.4, 0.4],[0.1, 0.7, 0.2],[0.44, 0.5, 0.06],...]

I wish to draw elements, where each element gets selected with the probability in the probability matrix. (Essentially I want to generate a one-hot encoding from my one-probability matrix). Which could for instance look like this given the above probabilities:

t = [2,1,2,...]

I need to do this for long sequences, and I need to do it millions of times, but only 1 time for each sequence each time. (Data augmentation for deep learning)

Does anyone have a good way of doing this?

Tue
  • 371
  • 1
  • 14

1 Answers1

1

You could use inverse transform sampling. Compute a cumulative distribution on your p matrix, sample a single random vector of size the height of the matrix, then return the largest index along each row of the cumulative matrix. In code:

p = np.array([[0.2, 0.4, 0.4],[0.1, 0.7, 0.2],[0.44, 0.5, 0.06]])
u = np.random.rand(p.shape[0])
idxs = (p.cumsum(1) < u).sum(1)

then the idxs will be sampled according to the rows of p. e.g.:

np.histogram((p[0].cumsum() < np.random.rand(10000,1)).sum(1), bins=3)
# array([1977, 4018, 4005]), ... 
kib
  • 61
  • 1
  • 4
  • This seems to be the same strategy posted in mathfux link, and it seems like a good solution. Thanks! – Tue Sep 18 '20 at 03:37