2

I am implementing an MCMC procedure, in which the most time-consuming part is calculating the logarithm of the negative binomial probability mass function (with matices as argument). The likelihood is computed in every iteration of the procedure for new values of parameters.

I wrote my own function, which is faster than in-built scipy nbinom.logpmf.

import numpy as np
import scipy.special as sc
from scipy.stats import nbinom

def my_logpmf_nb(x, n, p):
    """ logaritm of the negative binomial probability mass function """
    coeff = sc.gammaln(n+x) - sc.gammaln(x+1) - sc.gammaln(n)
    return coeff + n*np.log(p) + sc.xlog1py(x, -p)

N = 20
M = 8
p = np.random.uniform(0,1,(N,M))
r = np.abs(np.random.normal(10,10, (N,M)))
matrix = np.random.negative_binomial(r,p)

%timeit -n 1000 my_logpmf_nb(matrix, r, p)
16.4 µs ± 1.95 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

%timeit -n 1000 nbinom.logpmf(matrix, r, p)
62.7 µs ± 1.23 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

I tried to optimize further with the use of Cython, but I failed completely (function implemented in Cython is much slower).

from libc.math cimport log, log1p
import numpy as np
cimport cython
cdef:
    float EULER_MAS = 0.577215664901532 # euler mascheroni constant

@cython.cdivision(True)
def gammaln(float z, int n=100000):
    """Compute log of gamma function for some real positive float z"""
    """From: https://stackoverflow.com/questions/54850985/fast-algorithm-for-log-gamma-function"""

    cdef:
        float out = -EULER_MAS*z - log(z)
        int k
        float t
    for k in range(1, n):
        t = z / k
        out += t - log1p(t)

    return out

@cython.cdivision(True)
@cython.wraparound(False)
@cython.boundscheck(False)
def matrix_nb(double[:,:] x, double[:,:] nn, double[:,:] p):
    m = x.shape[0]
    n = x.shape[1]
    res = np.zeros((m, n))
    for i in range(m):
       for j in range(n):
           res[i,j] = gammaln(nn[i,j]+x[i,j]) - gammaln(x[i,j]+1) - gammaln(nn[i,j]) + nn[i,j]*log(p[i,j]) + x[i,j]*log1p(0-p[i,j])
    
    return res   
matrix_bis = matrix.astype("float")
%timeit -n 1000 matrix_nb(matrix_bis, r, p)
49.9 ms ± 181 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

Is there a way to implement this more efficiently? I would highly appreciate even a hint. Can I use Cython in a different way? Or maybe numba would be useful?

  • 1
    Have you tested for regression to ensure that the output is the same between your various methods? – Reinderien Feb 04 '23 at 17:48
  • 1
    It looks like your existing code is fairly nicely vectorized so I'd assume it's relatively efficient. There might be small optimisations from doing things in place and avoiding allocating temporary arrays (use the [efficient Cython interface to `gammaln`](https://docs.scipy.org/doc/scipy/reference/special.cython_special.html) and operate an element at a time, but I wouldn't expect huge speed ups) – DavidW Feb 04 '23 at 17:49
  • @Reinderien the output is the same. Although I suppose the Cython function might be less accurate since the approximation of gammaln depends on n – Elizabeth_Banks Feb 04 '23 at 17:54
  • If I am right, this is essentially evaluation of transcendental functions, which are performed at full processor speed. –  Feb 05 '23 at 14:34
  • "Cython implemented in Cython": what ? –  Feb 05 '23 at 14:34
  • You could try with numba's jit and njit. – Guimoute Feb 05 '23 at 15:48

2 Answers2

0

Observe the original implementation:

    def _logpmf(self, x, n, p):
        k = floor(x)
        combiln = (gamln(n+1) - (gamln(k+1) + gamln(n-k+1)))
        return combiln + special.xlogy(k, p) + special.xlog1py(n-k, -p)

You claim that the output is the same, but that's likely because you haven't tested certain edge cases. Your performance tests are not apples-to-apples, since the scipy implementation has more safety measures.

To improve performance you would need to drop down to a language that's closer to the metal, and possibly use GPGPU etc.

Reinderien
  • 11,755
  • 5
  • 49
  • 77
0

I tried to optimize further with the use of Cython, but I failed completely (function implemented in Cython is much slower).

The function gammaln of Scipy is mapped on the lgam native function coming from the Cephes math library. lgam calls lgam_sgn which appears to be already highly optimized. My understanding of the code is that it makes uses of a different numerical method converging more efficiently (not all numerical methods are equivalent in speed and accuracy) thanks to a logarithmic version of Stirling's formula using a polynomial approximation of degree 4 (mixed with a rational approximation). The thing with your method is that it appears to require a lot of iterations to provide an accuracy similar to the Scipy function. A numerical method requiring 1_000_000 iterations is generally considered inefficient (and should not be used in a high-performance code unless there is nothing better).

Is there a way to implement this more efficiently? I would highly appreciate even a hint. Can I use Cython in a different way?

A simple way to improve the execution time is simply to use multiple threads to compute the result in parallel. You can easily do that with prange. Note however that the computation should be sufficiently long for threads to be useful (it takes time to create threads, certainly more than computing few items of res). For more information, please read the documentation about that.

There is another way to speed the computation up : using SIMD instructions. That being said, this is far from being easy in this case, especially if you are not a highly-skilled developer. The idea is to compute multiple number at the same time using the same sequence of instruction in a unique function call. This is hard to implement here because of the possible branch divergence between the different SIMD lane. The same problem happens on GPU (ie. warp-level divergence). One solution is to use a numerical method that tends not to use branches or that can be implemented in a branch-less way. Your method has this property but the speed up provided by the SIMD implementation will certainly not be enough to outweigh the slow convergence. Besides, one needs to also implement log1p using SIMD instruction with similar issues. AFAIK, the Intel math library should implement this function using SIMD instructions. Using it from Cython is certainly not trivial though.

Or maybe numba would be useful?

Certainly not here. Indeed, assuming you use the compilation flags -O3 -march=native (and possibly -ffast-math regarding your needs) to compile the C code, the result should be pretty close. The main difference is that Numba uses LLVM compiler toolchain and add some small overhead when accessing Numpy arrays or doing some specific math operations, while the C code produced by Cython is generally compiled with the GCC compiler. The C code can certainly be compiled using Clang (based on LLVM too). I expect GCC and LLVM to produce binaries that are about equally fast in this case.

Jérôme Richard
  • 41,678
  • 6
  • 29
  • 59