I am implementing an MCMC procedure, in which the most time-consuming part is calculating the logarithm of the negative binomial probability mass function (with matices as argument). The likelihood is computed in every iteration of the procedure for new values of parameters.
I wrote my own function, which is faster than in-built scipy nbinom.logpmf
.
import numpy as np
import scipy.special as sc
from scipy.stats import nbinom
def my_logpmf_nb(x, n, p):
""" logaritm of the negative binomial probability mass function """
coeff = sc.gammaln(n+x) - sc.gammaln(x+1) - sc.gammaln(n)
return coeff + n*np.log(p) + sc.xlog1py(x, -p)
N = 20
M = 8
p = np.random.uniform(0,1,(N,M))
r = np.abs(np.random.normal(10,10, (N,M)))
matrix = np.random.negative_binomial(r,p)
%timeit -n 1000 my_logpmf_nb(matrix, r, p)
16.4 µs ± 1.95 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
%timeit -n 1000 nbinom.logpmf(matrix, r, p)
62.7 µs ± 1.23 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
I tried to optimize further with the use of Cython, but I failed completely (function implemented in Cython is much slower).
from libc.math cimport log, log1p
import numpy as np
cimport cython
cdef:
float EULER_MAS = 0.577215664901532 # euler mascheroni constant
@cython.cdivision(True)
def gammaln(float z, int n=100000):
"""Compute log of gamma function for some real positive float z"""
"""From: https://stackoverflow.com/questions/54850985/fast-algorithm-for-log-gamma-function"""
cdef:
float out = -EULER_MAS*z - log(z)
int k
float t
for k in range(1, n):
t = z / k
out += t - log1p(t)
return out
@cython.cdivision(True)
@cython.wraparound(False)
@cython.boundscheck(False)
def matrix_nb(double[:,:] x, double[:,:] nn, double[:,:] p):
m = x.shape[0]
n = x.shape[1]
res = np.zeros((m, n))
for i in range(m):
for j in range(n):
res[i,j] = gammaln(nn[i,j]+x[i,j]) - gammaln(x[i,j]+1) - gammaln(nn[i,j]) + nn[i,j]*log(p[i,j]) + x[i,j]*log1p(0-p[i,j])
return res
matrix_bis = matrix.astype("float")
%timeit -n 1000 matrix_nb(matrix_bis, r, p)
49.9 ms ± 181 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
Is there a way to implement this more efficiently? I would highly appreciate even a hint. Can I use Cython in a different way? Or maybe numba would be useful?