2

I have a big csr_matrix and I want to add over rows and obtain a new csr_matrix with the same number of columns but reduced number of rows. (Context: The matrix is a document-term matrix obtained from sklearn CountVectorizer and I want to be able to quickly combine documents according to codes associated with these documents)

For a minimal example, this is my matrix:

import numpy as np
from scipy.sparse import csr_matrix
from scipy.sparse import vstack

row = np.array([0, 4, 1, 3, 2])
col = np.array([0, 2, 2, 0, 1])
dat = np.array([1, 2, 3, 4, 5])
A = csr_matrix((dat, (row, col)), shape=(5, 5))
print A.toarray()

[[1 0 0 0 0]
 [0 0 3 0 0]
 [0 5 0 0 0]
 [4 0 0 0 0]
 [0 0 2 0 0]]

No let's say I want a new matrix B in which rows (1, 4) and (2, 3, 5) are combined by summing them, which would look something like this:

[[5 0 0 0 0]
 [0 5 5 0 0]]

And should be again in sparse format (because the real data I'm working with is large). I tried to sum over slices of the matrix and then stack it:

idx1 = [1, 4]
idx2 = [2, 3, 5]
A_sub1 = A[idx1, :].sum(axis=1)
A_sub2 = A[idx2, :].sum(axis=1)
B = vstack((A_sub1, A_sub2))

But this gives me the summed up values just for the non-zero columns in the slice, so I can't combine it with the other slices because the number of columns in the summed slices are different.

I feel like there must be an easy way to do this. But I couldn't find any discussion of this online or in the documentation. What am I missing?

Thank you for your help

Fridolin Linder
  • 401
  • 6
  • 12

2 Answers2

4

Note that you can do this by carefully constructing another matrix. Here's how it would work for a dense matrix:

>>> S = np.array([[1, 0, 0, 1, 0,], [0, 1, 1, 0, 1]])
>>> np.dot(S, A.toarray())
array([[5, 0, 0, 0, 0],
       [0, 5, 5, 0, 0]])
>>>

The sparse version is only a little more complicated. The information about which rows should be summed together is encoded in row:

col = range(5)
row = [0, 1, 1, 0, 1]
dat = [1, 1, 1, 1, 1]
S = csr_matrix((dat, (row, col)), shape=(2, 5))
result = S * A
# check that the result is another sparse matrix
print type(result)
# check that the values are the ones we want
print result.toarray()

Output:

<class 'scipy.sparse.csr.csr_matrix'>
[[5 0 0 0 0]
 [0 5 5 0 0]]

You can handle more rows in your output by including higher values in row and extending the shape of S accordingly.

YXD
  • 31,741
  • 15
  • 75
  • 115
  • Hi, thanks for your answer. I can't transform my matrix A to an array because it is to big. But I guess I can do the matrix multiply directly on the sparse matrix? – Fridolin Linder Apr 14 '15 at 14:59
  • Ah the last `toarray` is just to show that we're getting the right answer when we multiply `S * A` for this small example - I didn't mean that you'd convert to a dense array in your code. I'll add a comment. – YXD Apr 14 '15 at 15:01
1

The indexing should be:

idx1 = [0, 3]       # rows 1 and 4
idx2 = [1, 2, 4]    # rows 2,3 and 5

Then you need to keep A_sub1 and A_sub2 in sparse format and use axis=0:

A_sub1 = csr_matrix(A[idx1, :].sum(axis=0))
A_sub2 = csr_matrix(A[idx2, :].sum(axis=0))
B = vstack((A_sub1, A_sub2))
B.toarray()
array([[5, 0, 0, 0, 0],
       [0, 5, 5, 0, 0]])

Note, I think the A[idx, :].sum(axis=0) operations involve conversion from sparse matrices - so @Mr_E's answer is probably better.

Alternatively, it works when you use axis=0 and np.vstack (as opposed to scipy.sparse.vstack):

A_sub1 = A[idx1, :].sum(axis=0)
A_sub2 = A[idx2, :].sum(axis=0)
np.vstack((A_sub1, A_sub2))

Giving:

matrix([[5, 0, 0, 0, 0],
        [0, 5, 5, 0, 0]])
Lee
  • 29,398
  • 28
  • 117
  • 170
  • 1
    `A.sum(0)` uses `np.matrix(np.ones((1,5),int))*A`, which returns a dense matrix. `sparse.csr_matrix(np.ones((1,5),int))*A` returns sparse. – hpaulj Apr 14 '15 at 15:51