6

My use case is a bit specific. I want to sample 2 items without replacement from a list/array (of 50, or 100 elements). So I don't have to worry about arrays of sizes of 10^4 or 10^5 or multidimensional data.

I want to know

  1. Which one, numpy.random.choice() or numpy.random.shuffle() is faster for this purpose, and why?
  2. If they both produce random samples of "good quality"? That is, are both generating good random samples for my purpose, or does one produce less random samples? (Just a sanity check to make sure I am not overlooking something regarding the source codes of these functions).

For Question 1, I tried timing both functions (code below), and the shuffle method seems to about 5-6 times faster. Any insight you can give about this is most welcome. If there are faster ways to achieve my purpose, I will be glad to hear about them (I had looked at the options of python random module, but the fastest method from my testing was using np.random.shuffle()).

def shuffler(size, num_samples):
    items = list(range(size))
    np.random.shuffle(items)
    return items[:num_samples]
    
def chooser(size, num_samples):
    return np.random.choice(size, num_samples, replace=False)

%timeit shuffler(50, 2)
#> 1.84 µs ± 17.2 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
%timeit chooser(50, 2)
#> 13 µs ± 1.09 µs per loop (mean ± std. dev. of 7 runs, 100000 loops each)

You may think it's already optimized and I am wasting time trying to save pennies. But np.random.choice() is called 5000000 times in my code and takes about 8% of my runtime. It is being used in a loop to obtain 2 random samples from the population for each iteration. Pseudocode:

for t in range(5000000):
    # Random sample of 2 from the population without replacement.

If there is a smarter implementations for my requirement, I am open to suggestions.

PS: I am aware that shuffle performs in place operation, but as I just require the indices of the two random elements I do not essentially have to perform it on my original array. There are other questions that compares the two functions from python random module. But I require 2 samples without replacement.

Rithwik
  • 1,128
  • 1
  • 9
  • 28

4 Answers4

2

See the source code for numpy.random.choice; with replace=False it creates a 50-item temporary list, shuffles that list, and takes two items from that list.

Since version 1.17, the implementation decisions for numpy.random.choice and numpy.random.shuffle, as with other numpy.random functions, can't be changed without affecting backward compatibility (see the recent RNG policy for NumPy). See also these questions:

Compare numpy.random.choice with numpy.random.Generator.choice, which is the newer way to sample items in NumPy 1.17 and later. The advantage is that numpy.random.Generator.choice is not subject to the same compatibility guarantee as numpy.random.choice or numpy.random.shuffle. If you care about performance of numpy.random.Generator you can file an issue in NumPy's GitHub repository.

Peter O.
  • 32,158
  • 14
  • 82
  • 96
  • `rng = np.random.default_rng(); rng.choice(size, num_samples, replace=False)` is still slower than `shuffle` though, which is . . . sort of baffling – Daniel F Dec 09 '20 at 15:07
  • 1
    The advantage is that `numpy.random.Generator.choice` is not subject to the same compatibility guarantee as `numpy.random.choice` or `numpy.random.shuffle`. If you care about performance of `numpy.random.Generator` you can file an issue in NumPy's GitHub repository. – Peter O. Dec 09 '20 at 15:10
  • Thanks @PeterO. I'll try using the new generator method and compare. Let me go through the resources you shared as well. – Rithwik Dec 09 '20 at 15:31
2

To answer your questions:

  1. shuffle seems to be the fastest implementation
  2. It should give the same answers (in fact, it seems to be the same thing under the hood)

Let's start @SvenMarnach's answer here. This is not a dupe of that question, but the answer is useful. Unfortunately that answer doesn't bring it in line with shuffler timewise:

%timeit shuffler(50, 2)
2.47 µs ± 180 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

%timeit chooser(50, 2)
52.5 µs ± 3.58 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
    
rng = np.random.default_rng()
def chooser2(size, num_samples):
    return rng.choice(size, num_samples, replace=False)

%timeit chooser2(50, 2)
15.9 µs ± 1.41 µs per loop (mean ± std. dev. of 7 runs, 100000 loops each)

the random.sample answer is better:

import random 
def sampler(size, num_samples):
    return np.array(random.sample(range(size), num_samples))

%timeit sampler(50, 2)
4.6 µs ± 140 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)    

Still slower, though.

Since I'm not able to parse c code, I'll take sven's word for it that random.choice is doing shuffle-and-split in the background, so the methods should be equivalent. Why it's so much faster here is baffling to me though.

EDIT: A sample_indices based on @DaniMesejo's answer (slightly slower for num_samples = 2):

def sample_indices(pop, size, num_samples):
    arr = np.random.rand(pop, size)
    return np.argpartition(arr, num_samples, axis = 1)[:, :num_samples] 
Daniel F
  • 13,620
  • 2
  • 29
  • 55
  • Thanks for your suggestions. I have another question now though. `rng = np.random.default_rng(); def chooser2(size, num_samples): return rng.choice(size, num_samples, replace=False)` seems to be doing much better at `#> 16.9 µs ± 1.1 µs per loop (mean ± std. dev. of 7 runs, 100000 loops each)`. Any idea why? Should I put the rng declaration inside the loop or outside? Which of these is the "right"/"better" way to do things? I am not used to the new generators in numpy.random – Rithwik Dec 09 '20 at 15:22
  • Try to avoid using words like "right" or "better" on SO, it's a good way to get your question closed as soliciting opinions. `sampler` is faster, and should be as random as `chooser` since it's the same code underlying both. ***Why*** it's faster when the same code is underlying both is the strange thing. – Daniel F Dec 09 '20 at 15:35
  • Sorry, I didn't want to use those words. But anyways, do you know if I should leave the `rng=np.random.default_rng()` option inside or outside the function? It looks like a declaration that needs to be done only once? – Rithwik Dec 09 '20 at 15:47
  • 1
    From what I can see the generator only needs to be instantiated once, yes, so you can put that outside the function. – Daniel F Dec 09 '20 at 15:56
1

You could use an alternative solution, the idea is to generate a random array and then find the position of the min and max:

import numpy as np


def sample_indices(ran, size):
    arr = np.random.rand(ran, size)
    mi = np.argmin(arr, axis=1).reshape((-1, 1))
    ma = np.argmax(arr, axis=1).reshape((-1, 1))
    return np.hstack((mi, ma))


def shuffler(size, num_samples):
    items = list(range(size))
    np.random.shuffle(items)
    return items[:num_samples]


def chooser(size, num_samples):
    return np.random.choice(size, num_samples, replace=False)


def sample_indices_shuffler(ran, size):
    return np.array([shuffler(size, 2) for _ in range(ran)])


def sample_indices_chooser(ran, size):
    return np.array([chooser(size, 2) for _ in range(ran)])

Here are the timings:

%timeit sample_indices_chooser(1000, 50)
17.3 ms ± 1.74 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
%timeit sample_indices_shuffler(1000, 50)
2.69 ms ± 215 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
%timeit sample_indices(1000, 50)
553 µs ± 22.3 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

For:

res = sample_indices(10, 50)
print(res)

Output

[[ 9  6]
 [31 42]
 [17 42]
 [24 45]
 [ 2 49]
 [27 31]
 [21 19]
 [ 7 16]
 [20 28]
 [32 36]]
Dani Mesejo
  • 61,499
  • 6
  • 49
  • 76
  • Thanks for the interesting suggestion. Although some contentions. The type conversion to `np.array()` you've added in the `sample_indices_shuffler()` seems to be unnecessary since these are just indices I'm dealing with. For generating 1 sample at a time (`ran=1`), even after removing `np.hstack()` the `sample_indices()` function seems to be slower `sample_indices(1,50) #>5.69 µs ± 349 ns per loop` compared to `sample_indices_shuffler(1, 50) #>2.69 µs ± 236 ns per loop`. – Rithwik Dec 09 '20 at 16:02
  • @Rithwik But you want 1 or 50000? The idea behind `sample_indices` is to generate the 50000 pairs not just 1 – Dani Mesejo Dec 09 '20 at 16:03
  • The performance gain you suggest seems to be from generating a lot of samples together. However if I resort to change my algorithm to this implementation instead of one sample at a time, the gain I get is about 7 secs in the total run time of my code. `%timeit sample_indices(5000000,50) #> 2.97 s ± 30.9 ms per loop ` vs `%timeit sample_indices_shuffler(5000000, 50) #>10.4 s ± 133 ms per loop`. Which while isn't negligible, is probably not enough of a factor for me to change my implementation. – Rithwik Dec 09 '20 at 16:08
  • 1
    Thanks for method to to implement a random chooser of size 2 using `argmin` and `argmax`. It's interesting. I will keep it handy for some other use case. – Rithwik Dec 09 '20 at 16:12
1

numpy is better optimised for large arrays. Rejection sampling via the random module in the standard library makes it about twice as fast as the OPs best.

An example that hard codes num_choices=2 could be:

from random import randrange

def randrange_two(size):
    v1 = randrange(size)
    v2 = randrange(size)
    while v1 == v2:
        v2 = randrange(size)
    return v1, v2

this runs in ~0.7µs on my laptop, while shuffler runs in 1.7 µs. Note that putting the result into a numpy array slows things to the same speed as shuffler.

Not sure how useful this is, but thought it's worth posting.

Sam Mason
  • 15,216
  • 1
  • 41
  • 60
  • Thanks for this method @Sam. I don't need the output to be in a numpy array so this is good enough. I had tried a similar functional implementation but I had used `np.random.randint` and that seemed to be significantly worse. However `randrange` does seem to a good alternative. Although, in my laptop, the gain is smaller than what you got (for a size of 50). `%timeit randrange_two(50) #> 1.49 µs ± 89.1 ns per loop` `%timeit shuffler(50, 2) #> 1.85 µs ± 39.9 ns per loop` – Rithwik Dec 10 '20 at 05:37
  • 1
    However for higher sizes the `randrange` does seem scale much better! The `shuffler` seems to scale linearly with size, but `randrange` takes pretty much the same time (and maybe even improving with size). `%timeit randrange_two(500) #> 1.42 µs ± 69.9 ns per loop` `%timeit shuffler(500, 2) #> 16.5 µs ± 240 ns per loop`. So this may be a better solution for me since I may have to work with a size of 50-500. – Rithwik Dec 10 '20 at 05:38
  • @Rithwik in case it matters, I'm using Python 3.8.6 and numpy version 1.19.4 – Sam Mason Dec 11 '20 at 11:22