0

My goal here is to build the sparse CSR matrix very fast. It is currently the main bottleneck in my process, and I've already optimized it by constructing the coo matrix relatively fast, and then using tocsr().

However, I would imagine that constructing the csr matrix directly must be faster?

I have a very specific format of a sparse matrix that is also large (i.e. on orders of 100000x50000). I've looked online at these other answers, but most are not addressing the question I have.

Sparse Matrix Structure:

The sparse matrix, H, is comprised of W lists of size N, or built from an initial array of size NxW, lets call it A. Along the diagonal are repeating lists of size N for N times. So for the first N rows of H, is A[:,0] repeated, but sliding along N steps for each row.

Comparison to COO.tocsr()

When I scale up N and W, and build up the COO matrix first, then running tocsr(), it is actually faster then just building the CSR matrix directly. I'm not sure why this would be the case? I am wondering if perhaps I can take advantage of the structure of my sparse matrix H in some way? Since there are many repeating elements in there.

Code Sample

Here is a code sample to visualize what is going on for a small sample size:

from scipy.sparse import linalg, dok_matrix, coo_matrix, csr_matrix
import numpy as np
import matplotlib.pyplot as plt

def test_csr(testdata):
    indices = [x for _ in range(W-1) for x in range(N**2)]
    ptrs = [N*(i) for i in range(N*(W-1))]
    ptrs.append(len(indices))

    data = []
    # loop along the first axis
    for i in range(W-1):
        vec = testdata[:,i].squeeze()

        # repeat vector N times
        for i in range(N):
            data.extend(vec)

    Hshape = ((N*(W-1), N**2))
    H = csr_matrix((data, indices, ptrs), shape=Hshape)
    return H

N = 4
W = 8

testdata = np.random.randn(N,W)
print(testdata.shape)
H = test_csr(testdata)
plt.imshow(H.toarray(), cmap='jet')
plt.show()

Figure that shows the repeating structure of the sparse matrix

ajl123
  • 1,172
  • 5
  • 17
  • 40
  • If the input arrays have the correct shape and dtype, it will use them directly without making copies. I haven't studied the `csr_matrix` code enough to help with the details. Conversion from `coo` requires a lexsort, which may be costly. If you can read the python code you probably are already ahead of most us. – hpaulj Jan 16 '19 at 22:44

1 Answers1

0

It looks like your output has only the first W-1 rows of test data. I'm not sure if this is intentional or not. My solutions assumes you want to use all of testdata.

When you construct the COO matrix are you also constructing the indices and data in a similar way?

One thing which might speed up constructing the csr_matrix is to use built in numpy functions to generate the data for the csr_matrix rather than python loops and arrays. I would expect this to improve the speed of generating the indices significantly. You can adjust the dtype to different type of int depending on the size of your matrix.

N = 4
W = 8
testdata = np.random.randn(N,W)

ptrs = N*np.arange(W*N+1,dtype='int')
indices =  np.tile(np.arange(N*N,dtype='int'),W)
data = np.tile(testdata,N).flatten()

Hshape = ((N*W, N**2))
H = csr_matrix((data, indices, ptrs), shape=Hshape)

Another possibility is to first construct the large array, and then define each of the N vertical column blocks at once. This means that you don't need to make a ton of copies of the original data before you put it into the sparse matrix. However, it may be slow to convert the matrix type.

N = 4
W = 8
testdata = np.random.randn(N,W)

Hshape = ((N*W, N**2))
H = sp.sparse.lol_matrix(Hshape)

for j in range(N): 
    H[N*np.arange(W)+j,N*j:N*(j+1)] = testdata.T

H = H.tocsc()
overfull hbox
  • 747
  • 3
  • 10