I would like to calculate the spectral norms of N 8x8 Hermitian matrices, with N being close to 1E6. As an example, take these 1 million random complex 8x8 matrices:
import numpy as np
array = np.random.rand(8,8,1e6) + 1j*np.random.rand(8,8,1e6)
It currently takes me almost 10 seconds using numpy.linalg.norm
:
np.linalg.norm(array, ord=2, axis=(0,1))
I tried using the Cython code below, but this gave me only a negligible performance improvement:
import numpy as np
cimport numpy as np
cimport cython
np.import_array()
DTYPE = np.complex64
@cython.boundscheck(False)
@cython.wraparound(False)
def function(np.ndarray[np.complex64_t, ndim=3] Array):
assert Array.dtype == DTYPE
cdef int shape0 = Array.shape[2]
cdef np.ndarray[np.float32_t, ndim=1] normarray = np.zeros(shape0, dtype=np.float32)
normarray = np.linalg.norm(Array, ord=2, axis=(0, 1))
return normarray
I also tried numba and some other scipy functions (such as scipy.linalg.svdvals
) to calculate the singular values of these matrices. Everything is still too slow.
Is it not possible to make this any faster? Is numpy already optimized to the extent that no speed gains are possible by using Cython or numba? Or is my code highly inefficient and I am doing something fundamentally wrong?
I noticed that only two of my CPU cores are 100% utilized while doing the calculation. With that in mind, I looked at these previous StackOverflow questions:
Why does multiprocessing use only a single core after I import numpy?
multithreaded blas in python/numpy (didn't help)
and several others, but unfortunately I still don't have a solution.
I considered splitting my array into smaller chunks, and processing these in parallel (perhaps on the GPU using CUDA). Is there a way within numpy/Python to do this? I don't yet know where the bottleneck is in my code, i.e. whether it is CPU or memory-bound, or perhaps something different.