0

I am trying to speed up a function that randomly samples a number of records with the possible combinations of a number of categories for a number of records and ensures they are unique (i.e. let's assume there's 3 records, any of them can be either 0 or 1 and I want 10 random samples of unique possible combinations of records).

If I did not use numba, I might would do something like this:

import numpy as np

def myfunc(categories, NumberOfRecords, maxsamples):
  return np.unique( np.random.choice(np.arange(categories), size=(maxsamples*10, NumberOfRecords), replace=True), axis=0 )[0:maxsamples]

Annoyingly, numba does not support axis in np.unique, so I can do something like this, but some of the records may turn out to be non-unique.

from numba import njit, int64
import numpy as np

@njit(int64[:,:](int64, int64, int64), cache=True)
def myfunc(categories, NumberOfRecords, maxsamples):
  return np.random.choice(np.arange(categories), size=(maxsamples, NumberOfRecords), replace=True) 

myfunc(categories=2, NumberOfRecords=3, maxsamples=10)

E.g. in one call (obviously there's some randomness here), I got the below (for which the indices 1 and 6, and 3 and 4, and 7 and 9 are identical rows):

array([[0, 1, 1],
       [1, 1, 0],
       [0, 1, 0],
       [1, 0, 1],
       [1, 0, 1],
       [1, 1, 1],
       [1, 1, 0],
       [1, 0, 0],
       [0, 0, 0],
       [1, 0, 0]])

My questions are:

  1. Is this something where I would even expect a speed up from numba?
  2. If so, how can I get a unique rows (this seems rather difficult with numba, but presumably there's a way)?
  3. Perhaps there's a way to get at this more efficiently (perhaps without creating more random samples than I need in the end)?
Björn
  • 644
  • 10
  • 23

1 Answers1

-1

In the following, I don't use numba, but all the operations use vectorized numpy functions.

Each row of the result that you generate can be interpreted as an integer expressed in base N, where N is the number of categories. With that interpretation, what you want is to sample without replacement from the integers [0, 1, ... N**R-1], where R is the number of "records". You can use the choice function for that, with the argument replace=False. Once you have that, you need to convert the chosen integers to base N. For that, I use the function int2base, which is a pared down version of a function that I wrote in a different answer.

Here's the code:

import numpy as np


def int2base(x, base, ndigits):
    # x = np.asarray(x)  # Uncomment this line for general purpose use.
    powers = base ** np.arange(ndigits)
    digits = (x.reshape(x.shape + (1,)) // powers) % base
    return digits


def makesample(ncategories, nrecords, nsamples, rng=None):
    if rng is None:
        rng = np.random.default_rng()
    n = ncategories ** nrecords
    choices = rng.choice(n, replace=False, size=nsamples)
    return int2base(choices, ncategories, nrecords)

In makesample, I included the optional argument rng. It allows you to specify the object that holds the choice function. If not provided, it uses np.random.default_rng().

Example:

In [118]: makesample(2, 3, 6)                                                   
Out[118]: 
array([[0, 1, 1],
       [0, 0, 1],
       [1, 0, 1],
       [0, 0, 0],
       [1, 1, 0],
       [1, 1, 1]])

In [119]: makesample(5, 4, 12)                                                  
Out[119]: 
array([[3, 4, 0, 1],
       [2, 0, 2, 0],
       [4, 2, 4, 3],
       [0, 1, 0, 4],
       [0, 2, 0, 1],
       [1, 2, 0, 1],
       [0, 3, 0, 4],
       [3, 3, 0, 3],
       [3, 4, 1, 4],
       [2, 4, 1, 1],
       [3, 4, 1, 0],
       [1, 1, 4, 4]])

makesample will raise an exception if you ask for too many samples:

In [120]: makesample(2, 3, 10)                                                  
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-120-80044e78a60a> in <module>
----> 1 makesample(2, 3, 10)

~/code_snippets/python/numpy/random_samples_for_so_question.py in makesample(ncategories, nrecords, nsamples, rng)
     17         rng = np.random.default_rng()
     18     n = ncategories ** nrecords
---> 19     choices = rng.choice(n, replace=False, size=nsamples)
     20     return int2base(choices, ncategories, nrecords)

_generator.pyx in numpy.random._generator.Generator.choice()

ValueError: Cannot take a larger sample than population when 'replace=False'
Warren Weckesser
  • 110,654
  • 19
  • 194
  • 214