40

I'm working with some rather large sparse matrices (from 5000x5000 to 20000x20000) and need to find an efficient way to concatenate matrices in a flexible way in order to construct a stochastic matrix from separate parts.

Right now I'm using the following way to concatenate four matrices, but it's horribly inefficient. Is there any better way to do this that doesn't involve converting to a dense matrix?

rmat[0:m1.shape[0],0:m1.shape[1]] = m1
rmat[m1.shape[0]:rmat.shape[0],m1.shape[1]:rmat.shape[1]] = m2
rmat[0:m1.shape[0],m1.shape[1]:rmat.shape[1]] = bridge
rmat[m1.shape[0]:rmat.shape[0],0:m1.shape[1]] = bridge.transpose()
Jonathan Leffler
  • 730,956
  • 141
  • 904
  • 1,278
jones
  • 543
  • 1
  • 4
  • 7

4 Answers4

55

The sparse library now has hstack and vstack for respectively concatenating matrices horizontally and vertically.

Erik
  • 6,470
  • 5
  • 36
  • 37
  • 6
    Make sure you use scipy.sparse.hstack instead of numpy.hstack – pettinato Jul 12 '17 at 20:50
  • it should be added to this answer that`hstack`: concatenate sparse matrices with the `same number of rows` (horizontal concatenation) and `vstack`: concatenate sparse matrices with the `same number of columns` (vertical concatenation) – Farid Alijani Jun 15 '23 at 06:55
17

Amos's answer is no longer necessary. Scipy now does something similar to this internally if the input matrices are in csr or csc format and the desired output format is set to none or the same format as the input matrices. It's efficient to vertically stack matrices in csr format, or to horizontally stack matrices in csc format, using scipy.sparse.vstack or scipy.sparse.hstack, respectively.

Joel Croteau
  • 1,682
  • 12
  • 17
  • Which version does "now" refer to? Do you have any reference for this? – lenz Oct 23 '17 at 16:51
  • The relevant code is [this snippet](https://github.com/scipy/scipy/blob/master/scipy/sparse/construct.py#L552) from `scipy.sparse.bmat`, which both `vstack` and `hstack` use. This hack was originally added [here](https://github.com/scipy/scipy/commit/10b2cbdc980c6e1695c732c90fba99f722437171) in 2013. It looks like it was originally included in scipy 1.0.0. – Joel Croteau Oct 23 '17 at 23:09
  • 1
    Actually, I was wrong about that. It was originally included in 0.14. – Joel Croteau Oct 25 '17 at 00:41
16

Using hstack, vstack, or concatenate, is dramatically slower than concatenating the inner data objects themselves. The reason is that hstack/vstack converts the sparse matrix to coo format which can be very slow when the matrix is very large not and not in coo format. Here is the code for concatenating csc matrices, similar method can be used for csr matrices:

def concatenate_csc_matrices_by_columns(matrix1, matrix2):
    new_data = np.concatenate((matrix1.data, matrix2.data))
    new_indices = np.concatenate((matrix1.indices, matrix2.indices))
    new_ind_ptr = matrix2.indptr + len(matrix1.data)
    new_ind_ptr = new_ind_ptr[1:]
    new_ind_ptr = np.concatenate((matrix1.indptr, new_ind_ptr))

    return csc_matrix((new_data, new_indices, new_ind_ptr))
Amos
  • 427
  • 4
  • 6
  • 1
    Was just looking at a fast way of appending new rows to a CSR matrix. This is exactly what I need. Thanks @amos. – singleton Feb 20 '17 at 08:37
  • 1
    If you use this method you need to specify the shape in 'return csc_matrix((new_data, new_indices, new_ind_ptr))' ie: 'return csc_matrix((new_data, new_indices, new_ind_ptr), shape=(matrix1.shape[1], matrix1.shape[1] + matrix2.shape[1])' – simeon Jun 19 '17 at 05:20
  • What would be the code for csr matrices? Is the native scipy implementation really faster now? Because I have to concatenate four submatrices (upper-left, upper-right,lower-left,lower-right) and I am not satisfied with the result. It takes less time to recompute the entire matrix although I would only have to compute upper-right and lower-left. So this slowness essentially makes tabulation useless in my case. It annoys me because I think you would just have to change some pointers in C if both the matrix and the operation were optimally implemented. – Radio Controlled Feb 19 '19 at 07:59
  • Although I am not sure if the index pointers are stored in a list in C or in an array. If it were a list would you not just have to reset one pointer at the end of the list? The way it is now, the larger the matrix, the longer the stacking... – Radio Controlled Feb 19 '19 at 08:03
14

Okay, I found the answer. Using scipy.sparse.coo_matrix is much much faster than using lil_matrix. I converted the matrices to coo (painless and fast) and then just concatenated the data, rows and columns after adding the right padding.

data = scipy.concatenate((m1S.data,bridgeS.data,bridgeTS.data,m2S.data))
rows = scipy.concatenate((m1S.row,bridgeS.row,bridgeTS.row + m1S.shape[0],m2S.row + m1S.shape[0]))
cols = scipy.concatenate((m1S.col,bridgeS.col+ m1S.shape[1],bridgeTS.col ,m2S.col + m1S.shape[1])) 

scipy.sparse.coo_matrix((data,(rows,cols)),shape=(m1S.shape[0]+m2S.shape[0],m1S.shape[1]+m2S.shape[1]) )
jones
  • 543
  • 1
  • 4
  • 7