5

I have two sparse-matrices (created out of sklearn HashVectorizer, from two sets of features - each set corresponds to a feature). I want to concatenate them to later use them for clustering. But, I am facing a problem with dimensions, as the two matrices do not have the same row dimensions.

Here is an example:

Xa = [-0.57735027 -0.57735027  0.57735027 -0.57735027 -0.57735027  0.57735027
  0.5         0.5        -0.5         0.5         0.5        -0.5         0.5
  0.5        -0.5         0.5        -0.5         0.5         0.5        -0.5
  0.5         0.5       ]

Xb = [-0.57735027 -0.57735027  0.57735027 -0.57735027  0.57735027  0.57735027
 -0.5         0.5         0.5         0.5        -0.5        -0.5         0.5
 -0.5        -0.5        -0.5         0.5         0.5       ]

Both Xa and Xb are of type <class 'scipy.sparse.csr.csr_matrix'>. Shapes are Xa.shape = (6, 1048576) Xb.shape = (5, 1048576). The error I get is (which I know now why it happens):

    X = hstack((Xa, Xb))
  File "/usr/local/lib/python2.7/site-packages/scipy/sparse/construct.py", line 464, in hstack
    return bmat([blocks], format=format, dtype=dtype)
  File "/usr/local/lib/python2.7/site-packages/scipy/sparse/construct.py", line 581, in bmat
    'row dimensions' % i)
ValueError: blocks[0,:] has incompatible row dimensions

Is there a way to stack the sparse-matrices despite their irregular dimensions? Maybe with some padding?

I have looked into these posts:

Community
  • 1
  • 1
user1717931
  • 2,419
  • 5
  • 29
  • 40
  • can you post the shape of your matrices Xa and Xb? – João Almeida Nov 22 '16 at 15:08
  • updated post with shapes. – user1717931 Nov 22 '16 at 15:21
  • I think I found a work-around: concatenated using bumpy and converted the result to csr_matrix. Studying more to see if this is OK. Xc = np.concatenate([Xa.data, Xb.data]) and then doing: sm = sparse.csr_matrix(Xc). – user1717931 Nov 22 '16 at 15:24
  • 1
    Performance wise that is not a great idea, you should try to always keep the matrices in sparse format so you don't run out of memory. Did you try my answer? – João Almeida Nov 22 '16 at 15:27
  • Not yet tried. I am trying to understand what is happening. Please correct me if I am wrong: You are taking the matrix that has lower number of rows, vertical-stacking with a custom-matrix that has the difference of the row-numbers (here, 4-3 = 1) and the column-value being the same (Xb.shape[1]). Once you vstack it, the resulting matrix will have the same dimension as the other one. My question is this: This custom-matrix you are vstacking - what are its contents? are they zeroes? – user1717931 Nov 22 '16 at 15:38
  • Yes I'm stacking with an empty sparse matrix, which by definition is all zeros – João Almeida Nov 22 '16 at 15:50

1 Answers1

5

You can pad it with an empty sparse matrix.

You want to horizontaly stack it so you need to pad the smaller matrix so that it has the same number of rows as the larger matrix. For that you vertically stack it with a matrix of shape (difference in number of rows, number of columns of original matrix).

Like this:

from scipy.sparse import csr_matrix
from scipy.sparse import hstack
from scipy.sparse import vstack

# Create 2 empty sparse matrix for demo
Xa = csr_matrix((4, 4))
Xb = csr_matrix((3, 5))


diff_n_rows = Xa.shape[0] - Xb.shape[0]

Xb_new = vstack((Xb, csr_matrix((diff_n_rows, Xb.shape[1])))) 
#where diff_n_rows is the difference of the number of rows between Xa and Xb

X = hstack((Xa, Xb_new))
X

Which results in:

<4x9 sparse matrix of type '<class 'numpy.float64'>'
    with 0 stored elements in COOrdinate format>
João Almeida
  • 4,487
  • 2
  • 19
  • 35