2

How can I sample some of the rows of a scipy sparse matrix and form a new scipy sparse matrix from these sampled rows?

For eg. if I have a scipy sparse matrix A with 10 rows and I want to make a new scipy sparse matrix B with rows 1,3,4 from A, how to do that?

Sujay_K
  • 155
  • 1
  • 2
  • 10

1 Answers1

2

Left-multiply with an appropriate indicator matrix. The indicator matrix can be built using scipy.sparse.block_diag or directly, using csr format, as shown below.

>>> import numpy as np
>>> from scipy import sparse
>>> 
# create example
>>> m, n = 10, 8
>>> subset = [1,3,4]
>>> A = sparse.csr_matrix(np.random.randint(-10, 5, (m, n)).clip(0, None))
>>> A.A
array([[3, 2, 4, 0, 0, 0, 2, 0],
       [0, 0, 2, 0, 0, 0, 0, 0],
       [4, 0, 0, 0, 0, 2, 0, 0],
       [0, 0, 0, 0, 0, 0, 4, 0],
       [3, 0, 0, 0, 1, 4, 0, 0],
       [0, 0, 0, 0, 0, 0, 2, 0],
       [0, 0, 0, 4, 0, 4, 4, 0],
       [0, 2, 0, 0, 0, 3, 0, 0],
       [4, 0, 3, 3, 0, 0, 0, 2],
       [4, 0, 0, 0, 0, 2, 0, 1]], dtype=int64)
>>>
# build indicator matrix
# either using block_diag ...
>>> split_points = np.arange(len(subset)+1).repeat(np.diff(np.concatenate([[0], subset, [m-1]])))
>>> indicator = sparse.block_diag(np.split(np.ones(len(subset), int), split_points)).T
>>> indicator.A
array([[0, 1, 0, 0, 0, 0, 0, 0, 0, 0],
       [0, 0, 0, 1, 0, 0, 0, 0, 0, 0],
       [0, 0, 0, 0, 1, 0, 0, 0, 0, 0]], dtype=int64)
>>>
# ... or manually---this also works for non sorted non unique subset,
# and is therefore to be preferred over block_diag
>>> indicator = sparse.csr_matrix((np.ones(len(subset), int), subset, np.arange(len(subset)+1)), (len(subset), m))
>>> indicator.A
array([[0, 1, 0, 0, 0, 0, 0, 0, 0, 0],
       [0, 0, 0, 1, 0, 0, 0, 0, 0, 0],
       [0, 0, 0, 0, 1, 0, 0, 0, 0, 0]])
>>> 
# apply
>>> result = indicator@A
>>> result.A
array([[0, 0, 2, 0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0, 0, 4, 0],
       [3, 0, 0, 0, 1, 4, 0, 0]], dtype=int64)
Paul Panzer
  • 51,835
  • 3
  • 54
  • 99
  • Is there a way to index rows and then append them to the new matrix? What I actually want to do is create a new matrix which has let's say row 1 of the original matrix _k_ times, row 3 of the original matrix _l_ times and row 4 of the original matrix _m_ times. – Sujay_K May 26 '18 at 12:32
  • Looks like using vstack is one way to do it. But it's very inefficient because it creates a new matrix for stacking up each row vertically. – Sujay_K May 26 '18 at 13:08
  • 1
    @Sujay_K do you mean scaled or repeated? In the first case use `indicator = sparse.csr_matrix((your_weights, subset, np.arange(len(subset)+1)), (len(subset), m))`. In the second case use `indicator = sparse.csr_matrix((np.ones(np.sum(your_repeats), int), np.repeat(subset, your_repeats), np.arange(np.sum(your_repeats)+1)), (np.sum(your_repeats), m)) – Paul Panzer May 26 '18 at 13:33
  • @hpaulj Good to know. Thanks! – Paul Panzer May 26 '18 at 14:33
  • The `csr` indexing code uses an extractor matrix like your's. You've just reinvented the wheel. :) [Sparse matrix slicing using list of int](https://stackoverflow.com/questions/39500649/sparse-matrix-slicing-using-list-of-int) – hpaulj May 26 '18 at 15:37
  • @hpaulj reinvented a lazy shortcut more like ;-D – Paul Panzer May 26 '18 at 16:06
  • @PaulPanzer I meant repeated. – Sujay_K May 26 '18 at 16:17
  • @hpaulj Do you mean that I can simply do B[i] = A[j] to move rows of A into matrix B? – Sujay_K May 26 '18 at 16:18
  • @Sujay_K, if it works, I suspect a `csr` matrix assignment will give you a `efficiency` warning. `lil` format is better for assignment. Making a new matrix from rows or columns of a `csr` is ok. – hpaulj May 26 '18 at 16:29
  • @hpaulj Yeah it does give me that warning. So I can safely ignore that warning? – Sujay_K May 26 '18 at 21:39
  • 1
    @Sujay_K, the warning is there mainly to discourage it's use in a loop, repeatedly. For a one time action, the indexed `csr` setting might still be faster than alternatives (e.g. conversion to `lil`). – hpaulj May 26 '18 at 23:19