7

I have a problem where I need to create m samples of size n of a population without replacement. Additionally, this samples must retain the original order of the population vector. All of this being super fast.

Population = [50, 30, 12, 24, 420, 243, 173, 194, 123, 43, 21, 64, 34...]

300 samples of a combination of 3 
[[24, 21, 34], [50, 194, 21], [12, 173, 64], [30, 173, 194].... [12, 243, 34]]

This samples must be independent and in my case I need to preserve the order of the original population array. There's multiple posible answers but none of them are fast, they're all being bottlenecks on my code. I'm Using NumPy for the random number generation.

Some of the most promising ways are the followings:

  1. Use Numpy.random.choice which almost does the job but can only do so with replacement which produces samples with repeated numbers. This is very fast but then I would need to get rid off the bad samples in a fast way.
    gen = np.random.default_rng()
    def random_combination(population, sample, number = 3):
        with_replacement_samples = gen.choice(len(population), size=(sample, number))
        pairs = np.sort(with_replacement_samples)
        positions= positions[pairs]

        for i in positions:

            if i[0] == i[2] or i[0] == i[1] or i[1]== i[2]:
               continue #I would need to generate new sample each time ... #if is expensive


            yield I

  1. Another way is using this method I found in another answer but its very slow
def random_combination4(posiciones, sample, number = 3):
    pair = np.argpartition(gen.random((sample, len(posiciones))), number - 1, axis=-1)[:, :number]
    pair = np.sort(pair)
    for i in posiciones[pair]:
        yield I

  1. The last interesting way is using this guy method but this solutions was before nunmpy solved its performance issues with random numbers.
def random_combination(population, sample, number = 3, probabilities =  None):
    if probabilities is None:
        replicated_probabilities = np.tile( np.full(shape=num_elements,fill_value=1/num_elements), (num_samples, 1))
    else:
        replicated_probabilities = np.tile(probabilities, (num_samples, 1))
    # replicate probabilities as many times as `num_samples`
    # get random shifting numbers & scale them correctly
    random_shifts = gen.random(replicated_probabilities.shape)
    random_shifts /= random_shifts.sum(axis=1)[:, np.newaxis]
    # shift by numbers & find largest (by finding the smallest of the negative)
    shifted_probabilities = random_shifts - replicated_probabilities
    combinations = np.sort( np.argpartition(shifted_probabilities, sample_size, axis=1)[:, :sample_size])
    combinations = np.sort(combinations)
    for i in combinations:

        yield population[I] 


n. The last way would be to use a for but that's really expensive

def random_combination2(population, sample, number = 3):
    for i in range(sample):
        pair = np.sort( gen.choice(len(population), size = number, replace = False))
        yield population[pair[0]], population[pair[1]], population[pair[2]]
Nicolas R
  • 71
  • 4
  • Note that `numpy.random.choice(..., replace=False)` is well-known to be slow; however, note that due to [NumPy's new RNG policy](https://github.com/numpy/numpy/blob/master/doc/neps/nep-0019-rng-policy.rst), its [implementation "can't be fixed without breaking backward compatibility"](https://stackoverflow.com/questions/62309424/does-numpy-random-seed-always-give-the-same-random-number-every-time/62310046#62310046). Thus you will have to use a different approach than `numpy.random.choice(..., replace=False)` if you care about performance. – Peter O. Oct 03 '20 at 19:12
  • 3
    Can you please specify approximate values for `m`, `n` and for the size of the population? – Stef Oct 03 '20 at 20:00
  • Can you just shuffle the data? – Mad Physicist Oct 03 '20 at 21:17
  • @Stef m will vary between 500 and 4000, the population between 20 and 500 and n will always be 3 in my case. – Nicolas R Oct 04 '20 at 14:54

1 Answers1

1

Not sure if this is faster than your version, but maybe you can give it a try:

from random import shuffle
import numpy as np
#import pandas as pd  # activate this line, if you want to use pandas

# Just create a fake-population with randn, 
# you can just skip this line and set population 
# to your data. Just convert it to a numpy array
# in case it is not stored in one. 
# In case it consists of columns with different types,
# you can also use a pandas approach, which only differs in three lines
population= np.random.randn(10000, 1)
#population= pd.DataFrame(population) # activate this line if you want to try out pandas
indexes_orig= list(range(population.shape[0]))
shuffle(indexes_orig)
indexes= indexes_orig

m= 20  # m samples
n= 200 # of size n each
samples= list()
for i in range(m):
    sample_indexes= indexes[:n]
    sample_indexes.sort()
    indexes= indexes[n:]
    samples.append(population[sample_indexes, :])
    #samples.append(population.iloc[sample_indexes, :]) # uncomment this instead of the line above, if you want to use pandas, it assumes you use a 0 based index without gaps
jottbe
  • 4,228
  • 1
  • 15
  • 31