Efficient matrix operations in cython with no python objects

Question

I'm writing a function that I'd like to translate to C as much as possible using Cython. To do this, I'll need to use linear algebra operations. Here is my function. EDIT: The lesson I've learned is to try to take care of linear algebra outside of loops, which I've largely been able to do. Otherwise, resort to wrapping LAPACK/BLAS or writing my own functions.

import numpy as np
from scipy.stats import multivariate_normal as mv
import itertools

def llf(data, rho, mu, sigma, A, V, n):
    '''evaluate likelihood by guass-hermite quadrature

    Parameters
    ----------
    data : array
        N x J matrix, columns are measurements
    rho : array
        length L vector of weights for mixture of normals
    mu : array
        L x K vector of means of mixture of normals
    sigma : array
        K x L*K matrix of variance matrices for mixture of normals
    A : array
        J x (K + 1) matrix of loadings
    V : array
        J x J variance matrix of measurement errors
    n : int
        number of sample points for quadrature
    '''

    N = data.shape[0]
    L, K = mu.shape

    # getting weights and sample points for approximating integral
    v, w = np.polynomial.hermite.hermgauss(n)

    totllf = 0
    for i in range(N):
        M_i = data[i, :]
        totllf_i = 0
        for l in range(L):
            rho_l = rho[l]
            sigma_l = sigma[:, K*l:K*(l+1)]
            mu_l = mu[l, :]
            chol_l = np.linalg.cholesky(sigma_l)
            for ix in itertools.product(*(list(range(n)) for k in range(K))):
               wt =  np.prod(w[list(ix)])
               pt = np.sqrt(2)*chol_l.dot(v[list(ix)]) + mu_l
               totllf_i += wt*rho_l*mv.pdf(M_i, A[:, 0] + A[:, 1:].dot(pt), V)

        totllf += np.log(totllf_i)

    return totllf

In order to accomplish this I'd need to have functions for matrix multiplication, transpose, determinant, matrix inverse and cholesky decompostion. I've seen some posts on using BLAS functions, but its really unclear to me how to use these.

EDIT 04/29/18

As suggested, I've taken a memory view approach and initialized everything prior to the loop. My new function is written as

def llf_c(double[:, ::1] data, double[::1] rho, double[:, ::1] mu, 
                double[:, ::1] sigma, double[:, ::1] A, double[:, ::1] V, int n):
    '''evaluate likelihood by guass-hermite quadrature

    Parameters
    ----------
    data : array
        N x J matrix, columns are measurements
    rho : array
        length L vector of weights for mixture of normals
    mu : array
        L x K vector of means of mixture of normals
    sigma : array
        K x L*K matrix of variance matrices for mixture of normals
    A : array
        J x (K + 1) matrix of loadings
    V : array
        J x J variance matrix of measurement errors
    n : int
        number of sample points for quadrature
    '''

    cdef Py_ssize_t N = data.shape[0], J = data.shape[1], L = mu.shape[0], K = mu.shape[1]

    # initializing indexing variables
    cdef Py_ssize_t i, l, j, k

    # getting weights and sample points for approximating integral
    v_a, w_a = np.polynomial.hermite.hermgauss(n)
    cdef double[::1] v = v_a
    cdef double[::1] w = w_a
    cdef double[::1] v_ix = np.zeros(K, dtype=np.float)

    # initializing memory views for cholesky decomposition of sigma matrices
    sigma_chol_a = np.zeros((K, L*K), dtype=np.float)
    for l in range(L):
        sigma_chol_a[:, K*l:K*(l+1)] = np.linalg.cholesky(sigma[:, K*l:K*(l+1)])

    cdef double[:, ::1] sigma_chol = sigma_chol_a

    # intializing V inverse and determinant
    cdef double[:, ::1] V_inv = np.linalg.inv(V)
    cdef double V_det = np.linalg.det(V)

    # initializing memoryviews for work matrices
    cdef double[::1] work_K = np.zeros(K, dtype=np.float)
    cdef double[::1] work_J = np.zeros(J, dtype=np.float)

    # initializing memoryview for quadrature points
    cdef double[::1] pt = np.zeros(K, dtype=np.float)

    # initializing memorview for means for multivariate normal
    cdef double[::1] loc = np.zeros(J, dtype=np.float)

    # initializing values for loop
    cdef double[::1] totllf = np.zeros(N, dtype=np.float)
    cdef double wt, pdf_init = 1./sqrt(((2*pi)**J)*V_det)
    cdef int[:, ::1] ix_vals = np.vstack(itertools.product(*(list(range(n)) for k in range(K)))).astype(np.int32)
    cdef Py_ssize_t ix_len = ix_vals.shape[0]

    for ix_row in range(ix_len):

        ix = ix_vals[ix_row]

        # weights and points for quadrature
        wt = 1.
        for k in range(K):
            wt *= w[ix[k]]
            v_ix[k] = v[ix[k]]

        for l in range(L):

            # change of variables
            dotmv_c(sigma_chol[:, K*l:K*(l+1)], v_ix, work_K)
            for k in range(K):
                pt[k] = sqrt(2)*work_K[k]
            addvv_c(pt, mu[l, :], pt)

            for i in range(N):

                # generating demeaned vector for multivariate normal pdf
                dotmv_c(A[:, 1:], pt, work_J)
                addvv_c(A[:, 0], work_J, work_J)
                for j in range(J):
                    loc[j] = data[i, j] - work_J[j]

                # performing matrix products in exponential
                # print(wt, rho[l], np.asarray(work_J))
                dotvm_c(loc, V_inv, work_J)
                totllf[i] += wt*rho[l]*pdf_init*exp(-0.5*dotvv_c(work_J, loc))

    return np.log(np.asarray(totllf)).sum()

The dotvm_c, dotmv_c and addvv_c are functions that performance matrix multiplication of vector and matrix, matrix and vector, and elementwise addition of two vectors. I've written these in Cython as well, but am not including for brevity. I'm no longer wrapping any LAPACK functions as I do all other linear algebra prior to the loop using numpy. I have a few more questions. Why do I still have yellow in the loop? (See profile below). I thought everything should be in C now. Also if you have any other suggestions based on new implementation, please let me know.

For instance, in line 221, I get this message upon compiling: "Index should be typed for more efficient access." But I thought I typed the index k. Also, since addvv_c shows up in yellow, I'll show you it's definition below.

cpdef void addvv_c(double[:] a, double[:] b, double[:] out):
    '''add two vectors elementwise
    '''
    cdef Py_ssize_t i, n = a.shape[0]

    for i in range(n):
        out[i] = a[i] + b[i]

My advice would be to first prototype your approach in native Python, using NumPy and SciPy. This way you can provide us with a [MCVE](https://stackoverflow.com/help/mcve), by which we can help you solve your problem. Also note that NumPy is ruthlessly efficient when it comes to matrix operations, so you may not need to resort to Cython or BLAS after all. — MPA, Apr 26 '18 at 18:37
Oh and to comment on using Numpy. Because my whole problem cannot be expressed in matrix operations, I need to use loops, so using Python objects like Numpy arrays should be costly. Even if I could recast this particular problem without using for loops, I think it would be extremely useful to see how to use scipy's wrappers for BLAS functions to perform linear algebra. — jtorca, Apr 26 '18 at 20:53
The whole point of using cython is that you can loop over a numpy array almost without overhead and it is some orders of magnitude faster than using lists. — ead, Apr 27 '18 at 03:53
And what is your question? How to use blas? Just use numpy - it uses blas under the hood... — ead, Apr 27 '18 at 04:01
I can potentially have many loops in the function, especially as K or L grow. I'd like to use Cython, but to get much out of it I assume I should not have Python objects within the loop. But if I use numpy array operations within the loop, I'm using Python objects. So if I can replace these numpy calls to C calls within the loop, I assume I can speed things up a lot. — jtorca, Apr 27 '18 at 05:34
It depends how big your arrays are. For very small arrays (where the Python overhead is comparable to the cost of doing the array operation) it's sometimes worth calling blas from Cython. For mid-sized arrays upwards you won't get much benefit — DavidW, Apr 27 '18 at 13:28
I expect these arrays to be small, as they are variance matrices. So 5x5, 10x10. — jtorca, Apr 27 '18 at 14:25

DavidW · Accepted Answer · 2020-07-03T06:21:35.737

A few of small points with respect to your optimized Cython/BLAS functions:

ipiv_a = np.zeros(n).astype(np.int32)
cdef int[::1] ipiv = ipiv_a

can have two easy improvements: it doesn't has to go through a temporary variable, and you can directly create an array of type np.int32 rather than create an array of a different type then casting:

 cdef int[::1] ipiv = np.zeros(n,dtype=np.int32)

Simiarly, in both fucntions you can initialize B in a fewer steps by doing

cdef double[:, ::1] B = A.copy()

rather than creating a zero array and copying

The second (more significant) change would be to use C arrays for temporary variables such as the Fortran workspaces. I'd still keep things like return values as numpy arrays since the reference counting and ability to send them back to Python is very useful.

 cdef double* work = <double*>malloc(n*n*sizeof(double))
 try:
     # rest of function
 finally:
     free(work)

You need to cimport malloc and free from libc.stdlib. The try: ... finally: ensures the memory is correctly deallocated. Don't go too over the top with this - for example, if it isn't obvious where to deallocate a C array then just use numpy.

The final option to look at is to not have a return value but to modify an input:

cdef void inv_c(double[:,::1] A, double[:,::1] B):
    # check that A and B are the right size, then just write into B
    # ...

The advantage of this is that if you need to call this in a loop with identical sized inputs then you only need to do one allocation for the whole loop. You could extend this to include the working arrays too, although that might be a little more complicated.

Efficient matrix operations in cython with no python objects

1 Answers1

Linked