Computing the spectral norms of ~1m Hermitian matrices: `numpy.linalg.norm` is too slow

Question

I would like to calculate the spectral norms of N 8x8 Hermitian matrices, with N being close to 1E6. As an example, take these 1 million random complex 8x8 matrices:

import numpy as np

array = np.random.rand(8,8,1e6)  + 1j*np.random.rand(8,8,1e6)

It currently takes me almost 10 seconds using numpy.linalg.norm:

np.linalg.norm(array, ord=2, axis=(0,1))

I tried using the Cython code below, but this gave me only a negligible performance improvement:

import numpy as np
cimport numpy as np
cimport cython

np.import_array()

DTYPE = np.complex64

@cython.boundscheck(False)
@cython.wraparound(False)
def function(np.ndarray[np.complex64_t, ndim=3] Array):
    assert Array.dtype == DTYPE
    cdef int shape0 = Array.shape[2]
    cdef np.ndarray[np.float32_t, ndim=1] normarray = np.zeros(shape0, dtype=np.float32)
    normarray = np.linalg.norm(Array, ord=2, axis=(0, 1))
    return normarray

I also tried numba and some other scipy functions (such as scipy.linalg.svdvals) to calculate the singular values of these matrices. Everything is still too slow.

Is it not possible to make this any faster? Is numpy already optimized to the extent that no speed gains are possible by using Cython or numba? Or is my code highly inefficient and I am doing something fundamentally wrong?

I noticed that only two of my CPU cores are 100% utilized while doing the calculation. With that in mind, I looked at these previous StackOverflow questions:

and several others, but unfortunately I still don't have a solution.

I considered splitting my array into smaller chunks, and processing these in parallel (perhaps on the GPU using CUDA). Is there a way within numpy/Python to do this? I don't yet know where the bottleneck is in my code, i.e. whether it is CPU or memory-bound, or perhaps something different.

You will not see any performance benefit from Cython or numba if you are simply using them to call a numpy function. Cython and numba have no awareness of the internal workings of numpy, and can't do anything to optimize numpy functions - you would have to write your own low-level loops over the array to compute the norm. — ali_m, Nov 09 '15 at 00:35
This calculation scales linearly with `N`. Simply generating the array takes 1/4 the time the norm takes. And on my relatively old machine, 1e6 is too large to even generate the array. So a big part of the speed issue is the shear size of the data. — hpaulj, Nov 09 '15 at 06:51
Thank you for the answers, I guess I cannot beat numpy then except rewriting everything on a very low-level loop. — BPresent, Nov 09 '15 at 15:06

hpaulj · Accepted Answer · 2015-11-09T07:41:12.450

Digging into the code for np.linalg.norm, I've deduced, that for these parameters, it is finding the maximum of matrix singular values over the N dimension

First generate a small sample array. Make N the first dimension to eliminate a rollaxis operation:

In [268]: N=10; A1 = np.random.rand(N,8,8)+1j*np.random.rand(N,8,8)

In [269]: np.linalg.norm(A1,ord=2,axis=(1,2))
Out[269]: 
array([ 5.87718306,  5.54662999,  6.15018125,  5.869058  ,  5.80882818,
        5.86060462,  6.04997992,  5.85681085,  5.71243196,  5.58533323])

the equivalent operation:

In [270]: np.amax(np.linalg.svd(A1,compute_uv=0),axis=-1)
Out[270]: 
array([ 5.87718306,  5.54662999,  6.15018125,  5.869058  ,  5.80882818,
        5.86060462,  6.04997992,  5.85681085,  5.71243196,  5.58533323])

same values, and same time:

In [271]: timeit np.linalg.norm(A1,ord=2,axis=(1,2))
1000 loops, best of 3: 398 µs per loop
In [272]: timeit np.amax(np.linalg.svd(A1,compute_uv=0),axis=-1)
1000 loops, best of 3: 389 µs per loop

And most of the time spent in svd, which produces an (N,8) array:

In [273]: timeit np.linalg.svd(A1,compute_uv=0)
1000 loops, best of 3: 366 µs per loop

So if you want to speed up the norm, you have look further into speeding up this svd. svd uses np.linalg._umath_linalg functions - that is a .so file - compiled.

The c code is in https://github.com/numpy/numpy/blob/97c35365beda55c6dead8c50df785eb857f843f0/numpy/linalg/umath_linalg.c.src

It sure looks like this is the fastest you'll get. There's no Python level loop. Any looping is in that c code, or the lapack function it calls.

Thank you, I will have a look at this. I still wonder why Numpy does not use more than one cpu core . I assume, that for this particular task, Numpy is just unable to do so? — BPresent, Nov 09 '15 at 15:08
Whether or not `np.linalg.svd` uses multiple cores when performing SVD will depend on which LAPACK shared library you are linked against. For example if you link against OpenBLAS then you will probably see it using multiple cores. However the parallelization would be within each 8x8 submatrix rather than over the vector of N matrices, so it's unlikely to make a huge difference in terms of performance. — ali_m, Nov 09 '15 at 16:31

ali_m · Answer 2 · 2015-11-11T16:50:05.530

2

np.linalg.norm(A, ord=2) computes the spectral norm by finding the largest singular value using SVD. However, since your 8x8 submatrices are Hermitian, their largest singular values will be equal to the maximum of their absolute eigenvalues (see here):

import numpy as np

def random_symmetric(N, k):
    A = np.random.randn(N, k, k)
    A += A.transpose(0, 2, 1)
    return A

N = 100000
k = 8
A = random_symmetric(N, k)

norm1 = np.abs(np.linalg.eigvalsh(A)).max(1)
norm2 = np.linalg.norm(A, ord=2, axis=(1, 2))

print(np.allclose(norm1, norm2))
# True

Eigendecomposition on a Hermitian matrix is quite a bit faster than SVD:

In [1]: %%timeit A = random_symmetric(N, k)
np.linalg.norm(A, ord=2, axis=(1, 2))
   ....: 
1 loops, best of 3: 1.54 s per loop

In [2]: %%timeit A = random_symmetric(N, k)
np.abs(np.linalg.eigvalsh(A)).max(1)
   ....: 
1 loops, best of 3: 757 ms per loop

edited Nov 11 '15 at 16:50

answered Nov 10 '15 at 09:55

ali_m

71,714
23
223
298

1

The spectral norm of a hermitian matrix is the maximum of the absolute values of the eigenvalues, whether or not the matrix is positive definite. – dmuir Nov 11 '15 at 12:59
This indeed almost halves the required calculation time, thank you. Unrelated to this: Why did you edit the original question? To make it look nicer or for some badges? – BPresent Nov 13 '15 at 10:10
The main reason I edited your question was to try to make the title and the tags reflect the nature of the problem a bit more more clearly, rather than your attempted solutions. SO Q&As should ideally be useful resources for other people looking for solutions to similar problems. If I was googling for a fast way to compute the spectral norms of a large number of matrices then I would have a much better chance of finding this question with the current title rather than the old one that referred to Cython and Numba. Anyway, glad I could help. – ali_m Nov 13 '15 at 10:53

Computing the spectral norms of ~1m Hermitian matrices: `numpy.linalg.norm` is too slow

2 Answers2