numpy.random.multinomial at version 1.16.6 is 10x faster than later version

Question

Here are codes and result:

python -c "import numpy as np; from timeit import timeit; print('numpy version {}: {:.1f} seconds'.format(np.__version__, timeit('np.random.multinomial(1, [0.1, 0.2, 0.3, 0.4])', number=1000000, globals=globals())))"

numpy version 1.16.6:  1.5 seconds # 10x faster
numpy version 1.18.1: 15.5 seconds
numpy version 1.19.0: 17.4 seconds
numpy version 1.21.4: 15.1 seconds

It is noted that with fixed random seed, the output are the same with different numpy version

python -c "import numpy as np; np.random.seed(0); print(np.__version__); print(np.random.multinomial(1, [0.1, 0.2, 0.3, 0.4], size=10000))" /tmp/tt

Any advice on why numpy version after 1.16.6 is 10x slower?

We have upgraded pandas to latest version 1.3.4, which needs numpy version after 1.16.6

I filled [this issue](https://github.com/numpy/numpy/issues/20636) to get some feedback from the Numpy team and ask if they could improve the situation. — Jérôme Richard, Dec 21 '21 at 16:03

score 3 · Answer 1 · answered Dec 21 '21 at 04:48

TL;DR: this is a local performance regression caused by the overhead of additional checks in the numpy.random.multinomial function. Very small arrays are strongly impacted due to the relative execution time of the required checks.

Under the hood

A binary search on the Git commits of the Numpy code shows that the performance regression appear the first time in mid April 2019. It can be reproduced in the commit dd77ce3cb but not 7e8e19f9a. There are some build issues for the commit in-between, but with some quick fix we can show that the commit 0f3dd0650 is the first to cause the issue. The commit says that it:

Extend multinomial to allow broadcasting
Fix zipf changes missed in NumPy
Enable 0 as valid input for hypergeometric

A deeper analysis of this commit shows that it modifies the multinomial function defined in Cython file mtrand.pyx to perform two additional following checks:

def multinomial(self, np.npy_intp n, object pvals, size=None):
    cdef np.npy_intp d, i, sz, offset
    cdef np.ndarray parr, mnarr
    cdef double *pix
    cdef int64_t *mnix
    cdef int64_t ni

    d = len(pvals)
    parr = <np.ndarray>np.PyArray_FROM_OTF(pvals, np.NPY_DOUBLE, np.NPY_ALIGNED)
    pix = <double*>np.PyArray_DATA(parr)
    check_array_constraint(parr, 'pvals', CONS_BOUNDED_0_1)   # <==========[HERE]
    if kahan_sum(pix, d-1) > (1.0 + 1e-12):
        raise ValueError("sum(pvals[:-1]) > 1.0")

    if size is None:
        shape = (d,)
    else:
        try:
            shape = (operator.index(size), d)
        except:
            shape = tuple(size) + (d,)

    multin = np.zeros(shape, dtype=np.int64)
    mnarr = <np.ndarray>multin
    mnix = <int64_t*>np.PyArray_DATA(mnarr)
    sz = np.PyArray_SIZE(mnarr)
    ni = n
    check_constraint(ni, 'n', CONS_NON_NEGATIVE)              # <==========[HERE]
    offset = 0
    with self.lock, nogil:
        for i in range(sz // d):
            random_multinomial(self._brng, ni, &mnix[offset], pix, d, self._binomial)
            offset += d

    return multin

These two checks are required for the code to be robust. However, they are currently pretty expensive considering their purpose.

Indeed, on my machine, the first check is responsible for ~75% of the overhead and the second for ~20%. The checks takes few micro-seconds but since your input is very small, the overhead is huge compared to the computation time.

One workaround to fix this issue is to write a specific Numba function for this since your input array is very small. On my machine, np.random.multinomial in a trivial Numba function results in good performance.

dankal444 · Answer 2 · 2021-12-21T01:08:57.177

I checked some generators that are under the hood and saw no much change in the timings.

I guessed difference may be due to some overhead, because you are sampling only single value. And it seems to be good hypothesis. When I increased size of the generated random samples to 1000, difference between 1.16.6 and 1.19.2 (my current Numpy version) diminished to ~20%.

python -c "import numpy as np; from timeit import timeit; print('numpy version {}: {:.1f} seconds'.format(np.__version__, timeit('np.random.
multinomial(1, [0.1, 0.2, 0.3, 0.4], size=1000)', number=10000, globals=globals())))"

numpy version 1.16.6: 1.1 seconds
numpy version 1.19.2: 1.3 seconds

Note that both versions have this overhead, just newer version has it much larger. In both versions it is much faster to sample 1000 values once than sample 1 value 1000 times.

They changed by much the code between 1.16.6 and 1.17.0, see for example this commit, it's hard to analyse. Sorry that can't help you better - I propose to make an issue on Numpy's github.

Version 1.17.0 is where the new API was introduced, which may have something to do with it. Do you know if `np.random.default_rng().multinomial` has the same overhead? — Mad Physicist, Dec 21 '21 at 03:38
@MadPhysicist Yes, this is even much slower. Moving `np.random.default_rng()` to inizialization helps, but still this has same/bit worse performance — dankal444, Dec 21 '21 at 09:21

numpy.random.multinomial at version 1.16.6 is 10x faster than later version

2 Answers2

Under the hood