-1

I have 4 sparse matrixes with following dimensional:

X_train_content_sparse.shape
(62313, 100000)

X_train_title_sparse.shape
(62313, 100000)

X_train_author_sparse.shape
(62313,31540)

X_train_time_features_sparse.shape
(62313, 7)

And then I stack arrays in sequence horizontally.

X_train_sparse = hstack([X_train_content_sparse, X_train_title_sparse,
                                    X_train_author_sparse, X_train_time_features_sparse])

After that I transform this array of sparse matrixes into a sparse matrix. I apply csr_matrix(X_train_sparse) and I receive such error:

ValueError: The truth value of an array with more than one element is ambiguous. Use a.any() or a.all().

>X_train_sparse
array([ <62313x100000 sparse matrix of type '<class 'numpy.float64'>'
    with 68519885 stored elements in Compressed Sparse Row format>,
       <62313x100000 sparse matrix of type '<class 'numpy.float64'>'
    with 795892 stored elements in Compressed Sparse Row format>,
       <62313x31540 sparse matrix of type '<class 'numpy.uint8'>'
    with 62313 stored elements in Compressed Sparse Row format>,
       <62313x7 sparse matrix of type '<class 'numpy.int64'>'
    with 176241 stored elements in Compressed Sparse Row format>], dtype=object)
hpaulj
  • 221,503
  • 14
  • 230
  • 353
Daniel Chepenko
  • 2,229
  • 7
  • 30
  • 56

1 Answers1

2
In [83]: M
Out[83]: 
<10x10 sparse matrix of type '<class 'numpy.float64'>'
    with 20 stored elements in Compressed Sparse Row format>
In [84]: np.hstack([M,M])
Out[84]: 
array([<10x10 sparse matrix of type '<class 'numpy.float64'>'
    with 20 stored elements in Compressed Sparse Row format>,
       <10x10 sparse matrix of type '<class 'numpy.float64'>'
    with 20 stored elements in Compressed Sparse Row format>], dtype=object)
In [85]: sparse.csr_matrix(_)
...
ValueError: The truth value of an array with more than one element is ambiguous. Use a.any() or a.all().

That was the wrong hstack. The np.hstack knowns nothing about sparse matrices. So it just wraps each one in object array, and joins them into a 2 element array. So it's not surprising that csr_matrix has problems digesting that.

In [86]: sparse.hstack([M,M])
Out[86]: 
<10x20 sparse matrix of type '<class 'numpy.float64'>'
    with 40 stored elements in COOrdinate format>

The sparse.hstack converts all matrices into coo format, and then joins their rows,cols,data arrays appropriately, and then makes a new sparse matrix.

sparse.hstack with format parameter:

In [88]: sparse.hstack([M,M],format='csr')
Out[88]: 
<10x20 sparse matrix of type '<class 'numpy.float64'>'
    with 40 stored elements in Compressed Sparse Row format>
hpaulj
  • 221,503
  • 14
  • 230
  • 353
  • I used `scipy.sparse.hstack` for stacking, not numpy. Also, I have updated the question descrition – Daniel Chepenko Mar 29 '18 at 07:01
  • What you just added is produced by `np.hstack`, just like my `Out[84]`. `sparse.hstack` produces as sparse matrix, and takes a `format` parameter (see my edit). – hpaulj Mar 29 '18 at 07:22
  • 1
    You right. That was actually strange conflict between numpy and scipy functions. I have imported `from scipy.sparse import hstack` in that way, but on default it used numpy's `hstack`. When I changed to `scipy.hstack` it worked fine – Daniel Chepenko Mar 29 '18 at 08:27