24

I normally use

matrix[:, i:]

It seems not work as fast as I expected.

desertnaut
  • 57,590
  • 26
  • 140
  • 166
todpole3
  • 299
  • 1
  • 3
  • 10

2 Answers2

22

To obtain a sparse matrix as output the fastest way to do row slicing is to have a csr type, and for columns slicing csc, as detailed here. In both cases you just have to do what you are currently doing:

matrix[l1:l2, c1:c2]

If you want a ndarray as output it might be faster to perform the slicing directly in the ndarray object, which you can obtain from the sparse matrix using the .A attribute or the .toarray() method:

matrix.A[l1:l2, c1:c2] 

or:

matrix.toarray()[l1:l2, c1:c2]

As mentioned in the comment below, converting the sparse array to a dense array might lead to memory errors if the array is big enough.

Saullo G. P. Castro
  • 56,802
  • 26
  • 179
  • 234
14

I've found that the advertised fast row indexing of scipy.sparse.csr_matrix can be made a lot quicker by rolling your own row indexer. Here's the idea:

class SparseRowIndexer:
    def __init__(self, csr_matrix):
        data = []
        indices = []
        indptr = []

        # Iterating over the rows this way is significantly more efficient
        # than csr_matrix[row_index,:] and csr_matrix.getrow(row_index)
        for row_start, row_end in zip(csr_matrix.indptr[:-1], csr_matrix.indptr[1:]):
             data.append(csr_matrix.data[row_start:row_end])
             indices.append(csr_matrix.indices[row_start:row_end])
             indptr.append(row_end-row_start) # nnz of the row

        self.data = np.array(data)
        self.indices = np.array(indices)
        self.indptr = np.array(indptr)
        self.n_columns = csr_matrix.shape[1]

    def __getitem__(self, row_selector):
        data = np.concatenate(self.data[row_selector])
        indices = np.concatenate(self.indices[row_selector])
        indptr = np.append(0, np.cumsum(self.indptr[row_selector]))

        shape = [indptr.shape[0]-1, self.n_columns]

        return sparse.csr_matrix((data, indices, indptr), shape=shape)

That is, it is possible to utilize the fast indexing of numpy arrays by storing the non-zero values of each row in separate arrays (with a different length for each row) and putting all of those row arrays in an object-typed array (allowing each row to have a different size) that can be indexed efficiently. The column indices are stored the same way. The approach is slightly different to the standard CSR data structure which stores all non-zero values in a single array, requiring look-ups to see where each row starts and ends. These look-ups can slow down random access but should be efficient for retrieval of contiguous rows.

Profiling results

My matrix mat is a 1,900,000x1,250,000 csr_matrix with 400,000,000 non-zero elements. ilocs is an array of 200,000 random row indices.

>>> %timeit mat[ilocs]
2.66 s ± 233 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

compared to:

>>> row_indexer = SparseRowIndexer(mat)
>>> %timeit row_indexer[ilocs]
59.9 ms ± 4.51 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

The SparseRowIndexer seems to be faster when using fancy indexing compared to boolean masks.

Sorig
  • 1,200
  • 11
  • 24