Efficient way of creating a band (or banded) matrix

Question

I'm trying to create a certain style of band(ed) matrix (see Wikipedia). The following code works, but for large M (~300 or so) it becomes quite slow because of the for loop. Is there a way to vectorize it/make better use of NumPy and/or SciPy? I am having trouble figuring out the mathematical operation that this corresponds to, and hence I have not succeeded thus far.

The code I have is as follows

def banded_matrix(M):
    phis = np.linspace(0, 2*np.pi, M)
    i = 0
    ham = np.zeros((int(2*M), int(2*M)))
    for phi in phis:
        ham_phi = np.array([[1, 1],
                            [1, -1]])*(1+np.cos(phi))
        array_phi = np.zeros(M)
        array_phi[i] = 1
        mat_phi = np.diag(array_phi)
        ham += np.kron(mat_phi, ham_phi)
        i += 1
    return ham

With %timeit banded_matrix(M=300) it takes about 4 seconds on my computer.

Since the code is a bit opaque, what I want to do is construct a large 2M by 2M matrix. In a sense it has M entries on it's 'width 2' diagonal, where the entries are 2x2 matrices ham_phi that depend on phi. The matrix will afterwards be diagonalized, so perhaps one could even make use of its structure/the fact that it is rather sparse to speed that up, but of that I am not sure.

If anyone has an idea where to go with this, I'd be happy to follow up on that!

score 3 · Accepted Answer · answered Jun 21 '22 at 19:41

Your matrix is diagonal by blocks, so you can use scipy.linalg.block_diag:

import numpy as np
from scipy.linalg import block_diag

def banded_matrix_scipy(M):
    ham = np.array([[1, 1], [1, -1]])
    phis = np.linspace(0, 2 * np.pi, M)
    ham_phis = ham * (1 + np.cos(phis))[:, None, None]
    return block_diag(*ham_phis)

Let's check that it works and is faster:

b1 = banded_matrix(300)
b2 = banded_matrix_scipy(300)

np.all(b1 == b2)  # True

>>> %timeit banded_matrix(300)
>>> %timeit banded_matrix_scipy(300)
1.51 s ± 57 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
1.24 ms ± 4.57 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)

Michael Szczesny · Answer 2 · 2022-06-21T20:53:56.557

The obligatory np.einsum benchmark

def banded_matrix_einsum(M):
    return np.einsum('ij, kl-> ikjl',
              np.eye(M)*(1 + np.cos(np.linspace(0, 2 * np.pi, M))),
              np.array([[1, 1], [1, -1]])).reshape(2*M, 2*M)

banded_matrix_einsum(4)

Output

array([[ 2. ,  2. ,  0. ,  0. ,  0. ,  0. ,  0. ,  0. ],
       [ 2. , -2. ,  0. ,  0. ,  0. ,  0. ,  0. ,  0. ],
       [ 0. ,  0. ,  0.5,  0.5,  0. ,  0. ,  0. ,  0. ],
       [ 0. ,  0. ,  0.5, -0.5,  0. ,  0. ,  0. ,  0. ],
       [ 0. ,  0. ,  0. ,  0. ,  0.5,  0.5,  0. ,  0. ],
       [ 0. ,  0. ,  0. ,  0. ,  0.5, -0.5,  0. ,  0. ],
       [ 0. ,  0. ,  0. ,  0. ,  0. ,  0. ,  2. ,  2. ],
       [ 0. ,  0. ,  0. ,  0. ,  0. ,  0. ,  2. , -2. ]])

Benchmark results

import perfplot

perfplot.show(
    setup = lambda M: M,
    kernels = [banded_matrix_einsum, banded_matrix_scipy, banded_matrix],
    n_range = [50, 100, 150, 200, 250, 300],
    logx = False
)

scipy.linalg.block_diag vs np.einsum details

perfplot.show(
    setup = lambda M: M,
    kernels = [banded_matrix_einsum, banded_matrix_scipy],
    n_range = [50, 100, 150, 200, 250, 300, 350, 400],
    logx = False
)

Ali_Sh · Answer 3 · 2022-06-21T23:09:56.683

As another way, you can use numba accelerator to speed it up with jitting. I propose an equivalent scipy.linalg.block_diag numba method that is based on paime answer:

import numba as nb

@nb.njit
def block_diag_numba(result, ham_phis):
    for i in range(ham_phis.shape[0]):
        for j in range(ham_phis.shape[1]):
            result[i * 2,     i * 2:i * 2 + 2] = ham_phis[i, 0]
            result[i * 2 + 1, i * 2:i * 2 + 2] = ham_phis[i, 1]
    return result

def numba_(M):
    ham = np.array([[1, 1], [1, -1]])
    phis = np.linspace(0, 2 * np.pi, M)
    ham_phis = ham * (1 + np.cos(phis))[:, None, None]
    return block_diag_numba(np.zeros((M * ham.shape[1], M * ham.shape[1])), ham_phis)

This method will be faster than the previous ones at least 4-5 times for up to m=400 (us scale). This method can be adjust for other array shapes and improved by optimizing the code further (not using the paime answer) and bringing all code lines to numba function or parallelizing. I didn't go further because the paime answer performance seemed to be satisfiable by the OP acceptance; Just to show we can use numba to write much faster scipy.linalg.block_diag equivalent code:

Efficient way of creating a band (or banded) matrix

3 Answers3

Linked