Using memset instead of np.zeros in a Cython script for speed gains

Question

I started working on a SciPy interface to Fortran libraries (BLAS/LAPACK) as one can see here: Calling BLAS / LAPACK directly using the SciPy interface and Cython and came up with a solution but had to resort to using numpy.zeros which in effect killed any speed gains from calling the Fortran code directly. The issue was the Fortran code needs an 0 valued output matrix (it operates on the matrix in-place in memory) to match the Numpy version (np.outer).

So I was a bit perplexed since a 1000x1000 zeros matrix in Python only take 8us (using %timeit, or 0.008ms) so why did adding Cython code kill the runtime, noting I'm also creating it on a memoryview? (basically it goes from 3ms to 8ms or so on a 1000 x 1000 matrix multiplication). Then after searching around on SO I found a solution elsewhere using memset as the fastest array changing mechanism. So I rewrote the whole code from the referenced post to the newer memoryview format and got to something like this (Cython cyblas.pyx file):

import cython
import numpy as np
cimport numpy as np
from libc.string cimport memset #faster than np.zeros

REAL = np.float64
ctypedef np.float64_t REAL_t
ctypedef np.uint64_t  INT_t

cdef int ONE = 1
cdef REAL_t ONEF = <REAL_t>1.0

@cython.boundscheck(False)
@cython.wraparound(False)
@cython.cdivision(True)
cpdef outer_prod(double[::1] _x, double[::1] _y, double[:, ::1] _output):

    cdef int M = _y.shape[0]
    cdef int N = _x.shape[0]
    memset(&_output[0,0], 0, M*N)

    with nogil:
        dger(&M, &N, &ONEF, &_y[0], &ONE, &_x[0], &ONE, &_output[0,0], &M)

TEST SCRIPT

import numpy as np;
from cyblas import outer_prod;
a=np.random.randint(0,100, 1000);
b=np.random.randint(0,100, 1000);
a=a.astype(np.float64)
b=b.astype(np.float64)
cy_outer=np.zeros((a.shape[0],b.shape[0]));
np_outer=np.zeros((a.shape[0],b.shape[0]));

%timeit outer_prod(a,b, cy_outer)
%timeit np.outer(a,b, np_outer)

So this fixed my issue of resetting the output matrix values BUT ONLY TO ROW 125, BEYOND THAT THIS ISSUE STILL EXISTS (resolved see below). I thought setting the memset length parameter to M*N would clear 1000*1000 in memory but apparently not.

Does anyone have an idea on how to reset the whole output matrix to 0 using memset? Much appreciated.

Nope... `memset` _will_ work for any size. Are you 100% sure that M and N are what you intend them to be? Did you print them out and check? — cs95, Jul 01 '17 at 22:47
@Coldspeed I just added the print out and it shows M and N are both 1000 and M*N is 1000000, just as to be expected. (1000, 1000, 1000000) as that equals the #in the output array — Matt, Jul 01 '17 at 22:53
Okay, one more question. You're passing an array of 0s to the function anyway, so how do you know it only worked till row 125? — cs95, Jul 01 '17 at 22:55
@Coldspeed because after I run the posted script in Spyder I can pull up both arrays (`cy_outer` and `np_outer` and they match exactly until row 124x1000 columns). Thereafter you can tell that elements weren't reset to 0. The reason I know this is because the original code I posted kept adding the prior calc to the new calc. This `memset` routine shows that 0s are in the output array to a certain point. Could it have something to do with my C contiguous array vs. an F contiguous array? Or does `memset` only clear in a certain direction? Sorry guess I should have taken more C++ — Matt, Jul 01 '17 at 22:59
Hmm, can't say as I haven't used `memset` through ctypes before. You could try one thing: do `cy_outer = np.ascontiguousarray(np.zeros((a.shape[0],b.shape[0])))` instead of what you have now and see if you still face this issue. Not sure how much of a difference it makes since `np.zeros` internally calls `malloc` + `memset`. — cs95, Jul 01 '17 at 23:02
@Coldspeed That's what I'm trying to avoid... the np.zeros works every time its just slow. It's Cython BTW not ctypes although they are quite similar. In other words... if it works in C++ it should work here. — Matt, Jul 01 '17 at 23:03
@Coldspeed I think you're onto something: The memset() function fills the first n bytes of the memory area pointed to by s with the constant byte c. So M*N in my example is actually 1000*1000 float64_t values. So then to actually set all to 0 I probably need to do something like sizeof(np.float64) in there somewhere... — Matt, Jul 01 '17 at 23:17
I'm voting to close this question because the question has been defaced and it is probably not worth rolling back and answering because it contained way more code than it should have in the first place. — David Schwartz, Jul 01 '17 at 23:31
@Coldspeed You led me the right direction, post an answer and I'll accept it - it had to do with malloc for allocating not just the array dimensions but also the size (in bytes) of the variables. — Matt, Jul 01 '17 at 23:47
@Matt What'd be better would be for you to post an answer to your own question since it was actually you who solved it in the end. I'll up vote it. :) — cs95, Jul 02 '17 at 00:37
I think it's probably quicker to use `np.zeros` than memset - it's fairly fast to allocate zeroed memory but slower to fill it to 0 after allocating it (see [this C answer](https://stackoverflow.com/questions/2688466/why-mallocmemset-is-slower-than-calloc/2688522#2688522)) — DavidW, Jul 02 '17 at 20:29
@DavidW Thanks for the link - unfortunately at least how I'm calling `calloc` in Cython, `memset` is still much faster. Although it may just be the fact that I'm taking output as a memoryview input in the 1st version and creating from scratch the `calloc` version in Cython. `from libc.stdlib cimport calloc, free` followed by `cdef double* _output = calloc(M*N, variable_bytes)` and then just passing `_output` instead of `&_output` in the function call. Last line `free(_output)` If it could be defined on the function inputs cpdef (like a memoryview) then I bet it would be faster. — Matt, Jul 03 '17 at 21:28

Matt · Accepted Answer · 2017-07-02T02:49:40.440

[UPDATE - fixes are: it needed the #bytes not just the array size for M*N, i.e. M*N*variable_bytes as the length input. The logic is seen here by the prior results: row 125 is where it stopped *8 bytes = 1000 rows, hence answers the question. Still metrics are not great: 100 loops, best of 3: 5.41 ms per loop (cython) 100 loops, best of 3: 3.95 ms per loop (numpy) but solved nonetheless. Changes to the above code are to add: cdef variable_bytes = np.dtype(REAL).itemsize #added to get bytes for memset, after REAL is defined, in this case 8 bytes Then as you call memset: memset(&_output[0,0], 0, M*N*variable_bytes) # gives the same output as np.zeros function The only place to speed this up now I can see is to use the prange OpenMP statements on large matrices, but yet to be seen.

Using memset instead of np.zeros in a Cython script for speed gains

TEST SCRIPT

1 Answers1

Linked