Random sample in numba

Question

For performance reasons I often use numba and for my code I need to take a random sample without replacement. I found, that I could use the numpy.random function for that, but I noticed that it is extremely slow compared to the random.sample function. Am I doing something wrong? How could I improve the performance for the numba function? I boiled down my code to this minimal example:

import numpy as np
import numba as nb

def func2():
    List = range(100000)
    for x in range(20000):
        random.sample(List, 10)

@nb.njit()
def func3():
    Array = np.arange(100000)
    for x in range(20000):
        np.random.choice(Array, 10, False)

print(timeit(lambda: func2(), number=1))
print(timeit(lambda: func3(), number=1))
>>>0.1196
>>>20.1245

Edit: I'm now using my own sample function, which is much faster than np.random.choice.

@nb.njit()
def func4():
    for x in range(20000):
        rangeList = list(range(100000))
        result = []
        for x in range(10):
            randint = random.randint(0, len(rangeList) - 1)
            result.append(rangeList.pop(randint))
        return result
print(timeit(lambda: func4(), number=count))
>>>0.1767

The perfomance is decreased by `replace=False`, try to set it True — Alexander Riedel, Nov 19 '20 at 21:46
https://stackoverflow.com/questions/40914862/why-is-random-sample-faster-than-numpys-random-choice — Alexander Riedel, Nov 19 '20 at 22:05
Well it explains why random sample is sometimes faster than numpys random choice. And also how to overcome this a bit by using `np.random.default_rng().choice`. You are asking if you're doing something wrong, and the answer to this is: depends — Alexander Riedel, Nov 19 '20 at 22:11
Numba is used to optimize the performance of non-vectorized code, `np.random.choice` is a vectorized function that will not really be speeded up, but actually be slower using jit. You can easily check this by comparing the jitted version to a non-jitted (which is about three times slower) if you include compilation time, excluding compilation time it's twice as fast.. — Alexander Riedel, Nov 19 '20 at 22:24
I did all that. I'm surprised that my speed difference is so much larger and what I wonder what I could do about it. random.default_rng() is still much slower and has no numba support.Since this is part of a larger numba function I'm trying to optimize it for numba. — HighwayJohn, Nov 20 '20 at 08:40
I would be surprised if `func3()` does not compile to no-op nowadays, as it always return None. — norok2, Mar 14 '23 at 14:20

norok2 · Accepted Answer · 2023-03-14T17:31:15.993

TL;DR

In general np.random.choice(replace=False) performs well and generally better than random.sample() (which cannot be used inside Numba JITted functions in NoPython mode anyway), but for smaller value of k it is best to use random_sample_shuffle_idx(), except when k is very small, in which case random_sample_set() should be used.

Parallel Numba compilation may speed up execution for sufficiently large inputs, but will result in race conditions potentially invalidating the sampling, and it is therefore best avoided.

Discussion

A simple Numba compatible re-implementation of random.sample() can be easily written:

import random
import numba as nb
import numpy as np


@nb.njit
def random_sample_set(arr, k=-1):
    n = arr.size
    if k < 0:
        k = arr.size
    seen = {0}
    seen.clear()
    index = np.empty(k, dtype=arr.dtype)
    for i in range(k):
        j = random.randint(i, n - 1)
        while j in seen:
            j = random.randint(0, n - 1)
        seen.add(j)
        index[i] = j
    return arr[index]

This uses a temporary set() named seen which stores all the indices seen previously and avoids re-using them.

This should always be faster than random.sample().

A potentially much faster version of this can be written by just shuffling (a portion of a copy of) the input:

import random
import numba as nb


@nb.njit
def random_sample_shuffle(arr, k=-1):
    n = arr.size
    if k < 0:
        k = arr.size
    result = arr.copy()
    for i in range(k):
        j = random.randint(i, n - 1)
        result[i], result[j] = result[j], result[i]
    return result[:k].copy()

This is faster than random_sample_set() as long as the k parameter is sufficiently large. The larger the input array, the larger the k parameter needs to be to outperform random_sample_set(). This is so because random_sample_set() will have a high collision rate of j in seen, causing the while-loop to run multiple times on average.

Also, this has a memory footprint independent of k, which is higher than that of random_sample_set(), whose memory footprint is proportional to k.

A slight variation of random_sample_shuffle() is random_sample_shuffle_idx() which uses indices instead of copying the input. This would have a memory footprint independent of the data type, being more efficient for larger data types, and significantly faster for small values of k and typically on par (or slightly slower) in the general case:

import random
import numpy as np
import numba as nb


@nb.njit
def random_sample_shuffle_idx(arr, k=-1):
    n = arr.size
    if k < 0:
        k = arr.size
    index = np.arange(n)
    for i in range(k):
        j = random.randint(i, n - 1)
        index[i], index[j] = index[j], index[i]
    return arr[index[:k]]

When comparing the above with np.random.choice(replace=False):

import numpy as np
import numba as nb


@nb.njit
def random_sample_choice(arr, k=-1):
    if k < 0:
        k = arr.size
    return np.random.choice(arr, k, replace=False)

one would observe that this sits between random_sample_set() and random_sample_shuffle() when it comes to speed, as long as the input is sufficiently small and k is not too small.

Of course np.random.choice() and its newer counterpart random.Generator.choice() offer a lot more functionality than these simple implementations.

Benchmarks

Some quick benchmarks can be generated with the following:

funcs = random_sample_set, random_sample_shuffle


def is_good(x):
    return len(x) == len(set(x))


for q in range(4, 24, 4):
    n = 2 ** q
    arr = np.arange(n)
    seq = arr.tolist()
    for k in range(n // 8, n + 1, n // 8):
        print(f"n = {n}, k = {k}")
        func = random.sample
        print(f"{func.__name__:>24s} {is_good(func(seq, k))!s:>5s}", end=" ")
        %timeit -n 1 -r 1 func(seq, k)
        for func in funcs:
            print(f"{func.__name__:>24s} {is_good(func(arr, k))!s:>5s}", end=" ")
            %timeit -n 4 -r 4 func(arr, k)

The most interesting results are:

...
n = 65536, k = 65536
                        sample  True 41 ms ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each)
             random_sample_set  True 22 ms ± 1 ms per loop (mean ± std. dev. of 4 runs, 4 loops each)
         random_sample_shuffle  True 1 ms ± 113 µs per loop (mean ± std. dev. of 4 runs, 4 loops each)
     random_sample_shuffle_idx  True 948 µs ± 94 µs per loop (mean ± std. dev. of 4 runs, 4 loops each)
          random_sample_choice  True 918 µs ± 67.7 µs per loop (mean ± std. dev. of 4 runs, 4 loops each)
...
n = 1048576, k = 131072
                        sample  True 136 ms ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each)
             random_sample_set  True 19 ms ± 1.84 ms per loop (mean ± std. dev. of 4 runs, 4 loops each)
         random_sample_shuffle  True 5.85 ms ± 303 µs per loop (mean ± std. dev. of 4 runs, 4 loops each)
     random_sample_shuffle_idx  True 6.95 ms ± 445 µs per loop (mean ± std. dev. of 4 runs, 4 loops each)
          random_sample_choice  True 26.1 ms ± 1.93 ms per loop (mean ± std. dev. of 4 runs, 4 loops each)
...
n = 1048576, k = 917504
                        sample  True 916 ms ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each)
             random_sample_set  True 313 ms ± 47.6 ms per loop (mean ± std. dev. of 4 runs, 4 loops each)
         random_sample_shuffle  True 29.4 ms ± 1.87 ms per loop (mean ± std. dev. of 4 runs, 4 loops each)
     random_sample_shuffle_idx  True 32.8 ms ± 1.55 ms per loop (mean ± std. dev. of 4 runs, 4 loops each)
          random_sample_choice  True 28.2 ms ± 1.06 ms per loop (mean ± std. dev. of 4 runs, 4 loops each)
...

And the small k regime (which is the use-case of the question):

for q in range(4, 28, 4):
    n = 2 ** q
    arr = np.arange(n)
    seq = arr.tolist()
    k = 16
    print(f"n = {n}, k = {k}")
    func = random.sample
    print(f"{func.__name__:>24s} {is_good(func(seq, k))!s:>5s}", end=" ")
    %timeit -n 1 -r 1 func(seq, k)
    for func in funcs:
        print(f"{func.__name__:>24s} {is_good(func(arr, k))!s:>5s}", end=" ")
        %timeit -n 4 -r 4 func(arr, k)

n = 16, k = 16
                        sample  True 39.1 µs ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each)
             random_sample_set  True 5.11 µs ± 2.95 µs per loop (mean ± std. dev. of 4 runs, 4 loops each)
         random_sample_shuffle  True 2.62 µs ± 1.67 µs per loop (mean ± std. dev. of 4 runs, 4 loops each)
     random_sample_shuffle_idx  True 2.51 µs ± 1.47 µs per loop (mean ± std. dev. of 4 runs, 4 loops each)
          random_sample_choice  True 2.39 µs ± 1.47 µs per loop (mean ± std. dev. of 4 runs, 4 loops each)
n = 256, k = 16
                        sample  True 43.7 µs ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each)
             random_sample_set  True 3.67 µs ± 2.36 µs per loop (mean ± std. dev. of 4 runs, 4 loops each)
         random_sample_shuffle  True 2.59 µs ± 1.72 µs per loop (mean ± std. dev. of 4 runs, 4 loops each)
     random_sample_shuffle_idx  True The slowest run took 4.44 times longer than the fastest. This could mean that an intermediate result is being cached.
2.8 µs ± 2.16 µs per loop (mean ± std. dev. of 4 runs, 4 loops each)
          random_sample_choice  True 5.47 µs ± 1.8 µs per loop (mean ± std. dev. of 4 runs, 4 loops each)
n = 4096, k = 16
                        sample  True 33.4 µs ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each)
             random_sample_set  True 3.53 µs ± 1.73 µs per loop (mean ± std. dev. of 4 runs, 4 loops each)
         random_sample_shuffle  True 4.2 µs ± 1.81 µs per loop (mean ± std. dev. of 4 runs, 4 loops each)
     random_sample_shuffle_idx  True 3.23 µs ± 1.46 µs per loop (mean ± std. dev. of 4 runs, 4 loops each)
          random_sample_choice  True 51.7 µs ± 4.82 µs per loop (mean ± std. dev. of 4 runs, 4 loops each)
n = 65536, k = 16
                        sample  True 58.9 µs ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each)
             random_sample_set  True 4.15 µs ± 2.75 µs per loop (mean ± std. dev. of 4 runs, 4 loops each)
         random_sample_shuffle  True 35.3 µs ± 7.99 µs per loop (mean ± std. dev. of 4 runs, 4 loops each)
     random_sample_shuffle_idx  True 15.1 µs ± 5.03 µs per loop (mean ± std. dev. of 4 runs, 4 loops each)
          random_sample_choice  True The slowest run took 5.93 times longer than the fastest. This could mean that an intermediate result is being cached.
2.3 ms ± 1.61 ms per loop (mean ± std. dev. of 4 runs, 4 loops each)
n = 1048576, k = 16
                        sample  True 48.2 µs ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each)
             random_sample_set  True 3.89 µs ± 2.01 µs per loop (mean ± std. dev. of 4 runs, 4 loops each)
         random_sample_shuffle  True 1.87 ms ± 163 µs per loop (mean ± std. dev. of 4 runs, 4 loops each)
     random_sample_shuffle_idx  True 810 µs ± 195 µs per loop (mean ± std. dev. of 4 runs, 4 loops each)
          random_sample_choice  True 30.6 ms ± 2.41 ms per loop (mean ± std. dev. of 4 runs, 4 loops each)
n = 16777216, k = 16
                        sample  True 70.4 µs ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each)
             random_sample_set  True 5.15 µs ± 3.65 µs per loop (mean ± std. dev. of 4 runs, 4 loops each)
         random_sample_shuffle  True 103 ms ± 1.84 ms per loop (mean ± std. dev. of 4 runs, 4 loops each)
     random_sample_shuffle_idx  True 75.2 ms ± 3.31 ms per loop (mean ± std. dev. of 4 runs, 4 loops each)
          random_sample_choice  True 863 ms ± 77.3 ms per loop (mean ± std. dev. of 4 runs, 4 loops each)

score 0 · Answer 2 · answered Nov 19 '20 at 22:29

Because I did some time measurements, I want to show you the results (regarding my comments on your question)

import numpy as np
from timeit import timeit
import numba as nb
import random

def func2():
    List = range(100000)
    for x in range(1000):
        random.sample(List, 10)

@nb.njit()
def func3():
    Array = np.arange(100000)
    for x in range(1000):
        np.random.choice(Array, 10, replace=False)

def func4():
    Array = np.arange(100000)
    for x in range(1000):
        np.random.choice(Array, 10, replace=False)

def func5():
    Array = np.arange(100000)
    for x in range(1000):
        np.random.default_rng().choice(Array, 10, replace=False)

print(f"random.sample {timeit(lambda: func2(), number=1)}")
print(f"np.random.choice JIT incl. compiling {timeit(lambda: func3(), number=1)}")
print(f"np.random.choice JIT excl. compiling {timeit(lambda: func3(), number=1)}")
print(f"np.random.choice {timeit(lambda: func4(), number=1)}")
print(f"np.random.default_rng.choice {timeit(lambda: func5(), number=1)}")

Giving you:

random.sample 0.0090606
np.random.choice JIT incl. compiling 1.9129443
np.random.choice JIT excl. compiling 0.8365084999999999
np.random.choice 1.8339632999999997
np.random.default_rng.choice 0.049018499999999854

I noticed that as well but if I'm not doing anything wrong, is there a way to improve the performance of my numba function? — HighwayJohn, Nov 20 '20 at 08:43
I think it really depends on your real use case to optimize this.. — Alexander Riedel, Nov 20 '20 at 08:46
If I dont use numba the remaining part of my code would be slow... — HighwayJohn, Nov 20 '20 at 08:47
yes, so you're using `np.random.choice` for that exact problem as above? or is there something more? — Alexander Riedel, Nov 20 '20 at 08:48
Currently my code works differently but in principle beeing able to use np.random.choice / random.sample efficiently, should speed up my code significantly. I guess I will now take a look at how random.sample is coded and then check whether I can change that function to work with numba. — HighwayJohn, Nov 20 '20 at 08:51

Random sample in numba

2 Answers2

TL;DR

Discussion

Benchmarks