0

After checking the documentation and this question I tried to split a numpy array and a sparse scipy matrices as follows:

>>>print(X.shape) 
(2399, 39999)

>>>print(type(X))
<class 'scipy.sparse.csr.csr_matrix'>

>>>print(X.toarray())

[[0 0 0 ..., 0 0 0]
 [0 0 0 ..., 0 0 0]
 [0 0 0 ..., 0 0 0]
 ..., 
 [0 0 0 ..., 0 0 0]
 [0 0 0 ..., 0 0 0]
 [0 0 0 ..., 0 0 0]]

Then:

new_array = np.split(X,3)

Out:

ValueError: array split does not result in an equal division

Then I tried to:

new_array = np.hsplit(X,3)

Out:

ValueError: bad axis1 argument to swapaxes

Thus, How can I split the array into N different unequal sized chunks?.

Community
  • 1
  • 1
tumbleweed
  • 4,624
  • 12
  • 50
  • 81
  • 1
    It's possible that `np.split` doesn't work with sparse matrices. But conceptually all `np.split` does is return a list `arr[0:n1], arr[n1:n2], arr[n2:n3]....]`, ie just a list of slices. The `n's` are calculated from your parameter and the length of that dimension. There's no special 'efficiency' involved. – hpaulj Mar 27 '17 at 16:27
  • Could you provide the solution for sparse matrices?. – tumbleweed Mar 27 '17 at 16:34

2 Answers2

3

Make a sparse matrix:

In [62]: M=(sparse.rand(10,3,.3,'csr')*10).astype(int)
In [63]: M
Out[63]: 
<10x3 sparse matrix of type '<class 'numpy.int32'>'
    with 9 stored elements in Compressed Sparse Row format>
In [64]: M.A
Out[64]: 
array([[0, 7, 0],
       [0, 0, 0],
       [0, 0, 0],
       [0, 0, 0],
       [0, 0, 5],
       [0, 0, 2],
       [0, 0, 6],
       [0, 4, 4],
       [7, 1, 0],
       [0, 0, 2]])

The dense equivalent is easily split. array_split handles unequal chunks, but you can also spell out the split as illustrated in the other answer.

In [65]: np.array_split(M.A, 3)
Out[65]: 
[array([[0, 7, 0],
        [0, 0, 0],
        [0, 0, 0],
        [0, 0, 0]]), array([[0, 0, 5],
        [0, 0, 2],
        [0, 0, 6]]), array([[0, 4, 4],
        [7, 1, 0],
        [0, 0, 2]])]

In general numpy functions cannot work directly on sparse matrices. They aren't a subclass. Unless the function delegates the action to the array's own method, the function probably won't work. Often the function starts with np.asarray(M), which is not the same as M.toarray() (try it yourself).

But split is nothing more than slicing along the desired axis. I can produce the same 4,2,3 split with:

In [143]: alist = [M[0:4,:], M[4:7,:], M[7:10]]
In [144]: alist
Out[144]: 
[<4x3 sparse matrix of type '<class 'numpy.int32'>'
    with 1 stored elements in Compressed Sparse Row format>,
 <3x3 sparse matrix of type '<class 'numpy.int32'>'
    with 3 stored elements in Compressed Sparse Row format>,
 <3x3 sparse matrix of type '<class 'numpy.int32'>'
    with 5 stored elements in Compressed Sparse Row format>]
In [145]: [m.A for m in alist]
Out[145]: 
[array([[0, 7, 0],
        [0, 0, 0],
        [0, 0, 0],
        [0, 0, 0]], dtype=int32), array([[0, 0, 5],
        [0, 0, 2],
        [0, 0, 6]], dtype=int32), array([[0, 4, 4],
        [7, 1, 0],
        [0, 0, 2]], dtype=int32)]

The rest is administrative details.

I should add that sparse slices are never views. They are new sparse matrices with their own data attribute.


With the split indexes in a list, we can construct the split list with a simple iteration:

In [146]: idx = [0,4,7,10]
In [149]: alist = []
In [150]: for i in range(len(idx)-1):
     ...:     alist.append(M[idx[i]:idx[i+1]])   

I haven't worked out the details of how to construct idx, though an obvious starting point in the 10, the M.shape[0].

For even splits (that fit)

In [160]: [M[i:i+5,:] for i in range(0,M.shape[0],5)]
Out[160]: 
[<5x3 sparse matrix of type '<class 'numpy.int32'>'
    with 2 stored elements in Compressed Sparse Row format>,
 <5x3 sparse matrix of type '<class 'numpy.int32'>'
    with 7 stored elements in Compressed Sparse Row format>]
hpaulj
  • 221,503
  • 14
  • 230
  • 353
  • Thanks for the help and for clarifying this issue. Is there a more convenient way for managing the chink size instead of: M[0:4,:], M[4:6,:], M[7:10]. Although it is correct I guess I do not have control over those slice parameters. What about if I do not now the shape of the initial matrix?... – tumbleweed Mar 27 '17 at 17:27
  • 1
    If the split isn't even then you have to decide for yourself which block should be larger or smaller than the others - the first, the last, etc. – hpaulj Mar 27 '17 at 18:11
  • For this case is the same, I just want to implement incremental learning over a huge dataset. – tumbleweed Mar 27 '17 at 18:12
1

First, convert scipy.sparse.csr_matrix to numpy ndarray, then pass a list to numpy.split(ary, indices_or_sections, axis=0).

If indices_or_sections is a 1-D array of sorted integers, the entries indicate where along axis the array is split. For example, [2, 3] would, for axis=0, result in ary[:2] ary[2:3] ary[3:]

https://docs.scipy.org/doc/numpy/reference/generated/numpy.split.html

X1, X2, X3 = np.split(X.toarray(), [1000,2000])
gzc
  • 8,180
  • 8
  • 42
  • 62