I assume, you are already using @cython.boundscheck(False)
, so there is not much you can do to improve on it performance-wise.
For the readability reasons I would use:
cpc_x[:]=0.0
cpc_y[:]=0.0
the cython would translate this to for
-loops. An other additional advantage: even if @cython.boundscheck(False)
isn't used, the resulting C-code will be nonetheless without boundchecks (__Pyx_RaiseBufferIndexError
). Here is the resulting code for a[:]=0.0
:
{
double __pyx_temp_scalar = 0.0;
{
Py_ssize_t __pyx_temp_extent_0 = __pyx_v_a.shape[0];
Py_ssize_t __pyx_temp_stride_0 = __pyx_v_a.strides[0];
char *__pyx_temp_pointer_0;
Py_ssize_t __pyx_temp_idx_0;
__pyx_temp_pointer_0 = __pyx_v_a.data;
for (__pyx_temp_idx_0 = 0; __pyx_temp_idx_0 < __pyx_temp_extent_0; __pyx_temp_idx_0++) {
*((double *) __pyx_temp_pointer_0) = __pyx_temp_scalar;
__pyx_temp_pointer_0 += __pyx_temp_stride_0;
}
}
}
What could improve the performance is to declare the the memory views to be continuous (i.e. double[::1]
instead of double[:]
. The resulting C code for a[:]=0.0
would be then:
{
double __pyx_temp_scalar = 0.0;
{
Py_ssize_t __pyx_temp_extent = __pyx_v_a.shape[0];
Py_ssize_t __pyx_temp_idx;
double *__pyx_temp_pointer = (double *) __pyx_v_a.data;
for (__pyx_temp_idx = 0; __pyx_temp_idx < __pyx_temp_extent; __pyx_temp_idx++) {
*((double *) __pyx_temp_pointer) = __pyx_temp_scalar;
__pyx_temp_pointer += 1;
}
}
}
As one can see, strides[0]
is no longer used in the continuous version - strides[0]=1
is evaluated during the compilation and the resulting C-code can be better optimized (see for example here).
One could be tempted to get smart and to use low-level memset
-function:
from libc.string cimport memset
memset(&cpc_x[0], 0, 16*sizeof(double))
However, for bigger arrays there will no difference compared to the usage of continuous memory view (i.e. double[::1]
, see here for example). There might be less overhead for smaller sizes, but I never cared enough to check.