26

Let's say I have a 2-dimensional matrix as a numpy array. If I want to delete rows with specific indices in this matrix, I use numpy.delete(). Here is an example of what I mean:

In [1]: my_matrix = numpy.array([
   ...:     [10, 20, 30, 40, 50],
   ...:     [15, 25, 35, 45, 55],
   ...:     [95, 96, 97, 98, 99]
   ...: ])
In [2]: numpy.delete(my_matrix, [0, 2], axis=0)
Out[2]: array([[15, 25, 35, 45, 55]])

I'm looking for a way to do the above with matrices from the scipy.sparse package. I know it's possible to do this by converting the entire matrix into a numpy array but I don't want to do that. Is there any other way of doing that?

Thanks a lot!

pemistahl
  • 9,304
  • 8
  • 45
  • 75

7 Answers7

26

For CSR, this is probably the most efficient way to do it in-place:

def delete_row_csr(mat, i):
    if not isinstance(mat, scipy.sparse.csr_matrix):
        raise ValueError("works only for CSR format -- use .tocsr() first")
    n = mat.indptr[i+1] - mat.indptr[i]
    if n > 0:
        mat.data[mat.indptr[i]:-n] = mat.data[mat.indptr[i+1]:]
        mat.data = mat.data[:-n]
        mat.indices[mat.indptr[i]:-n] = mat.indices[mat.indptr[i+1]:]
        mat.indices = mat.indices[:-n]
    mat.indptr[i:-1] = mat.indptr[i+1:]
    mat.indptr[i:] -= n
    mat.indptr = mat.indptr[:-1]
    mat._shape = (mat._shape[0]-1, mat._shape[1])

In LIL format it's even simpler:

def delete_row_lil(mat, i):
    if not isinstance(mat, scipy.sparse.lil_matrix):
        raise ValueError("works only for LIL format -- use .tolil() first")
    mat.rows = np.delete(mat.rows, i)
    mat.data = np.delete(mat.data, i)
    mat._shape = (mat._shape[0] - 1, mat._shape[1])
pv.
  • 33,875
  • 8
  • 55
  • 49
13

Pv.s answer is a good and solid in-place solution that takes

a = scipy.sparse.csr_matrix((100,100), dtype=numpy.int8)
%timeit delete_row_csr(a.copy(), 0)
10000 loops, best of 3: 80.3 us per loop

for any array size. Since boolean indexing works for sparse matrices, at least in scipy >= 0.14.0, I would suggest to use it whenever multiple rows are to be removed:

def delete_rows_csr(mat, indices):
    """
    Remove the rows denoted by ``indices`` form the CSR sparse matrix ``mat``.
    """
    if not isinstance(mat, scipy.sparse.csr_matrix):
        raise ValueError("works only for CSR format -- use .tocsr() first")
    indices = list(indices)
    mask = numpy.ones(mat.shape[0], dtype=bool)
    mask[indices] = False
    return mat[mask]

This solution takes significantly longer for a single row removal

%timeit delete_rows_csr(a.copy(), [50])
1000 loops, best of 3: 509 us per loop

But is more efficient for the removal of multiple rows, as the execution time barely increases with the number of rows

%timeit delete_rows_csr(a.copy(), numpy.random.randint(0, 100, 30))
1000 loops, best of 3: 523 us per loop
loli
  • 395
  • 3
  • 10
  • 1
    Based on my experience with a dozen matrices of approximate size 100,000 x 250,000 with less than 1% density and deleting about 20,000 rows from each, this is by far the fastest algorithm. – Brent Baccala Sep 06 '19 at 13:30
8

In addition to @loli's version of @pv's answer, I expanded their function to allow for row and/or column deletion by index on CSR matrices.

import numpy as np
from scipy.sparse import csr_matrix

def delete_from_csr(mat, row_indices=[], col_indices=[]):
    """
    Remove the rows (denoted by ``row_indices``) and columns (denoted by ``col_indices``) from the CSR sparse matrix ``mat``.
    WARNING: Indices of altered axes are reset in the returned matrix
    """
    if not isinstance(mat, csr_matrix):
        raise ValueError("works only for CSR format -- use .tocsr() first")

    rows = []
    cols = []
    if row_indices:
        rows = list(row_indices)
    if col_indices:
        cols = list(col_indices)

    if len(rows) > 0 and len(cols) > 0:
        row_mask = np.ones(mat.shape[0], dtype=bool)
        row_mask[rows] = False
        col_mask = np.ones(mat.shape[1], dtype=bool)
        col_mask[cols] = False
        return mat[row_mask][:,col_mask]
    elif len(rows) > 0:
        mask = np.ones(mat.shape[0], dtype=bool)
        mask[rows] = False
        return mat[mask]
    elif len(cols) > 0:
        mask = np.ones(mat.shape[1], dtype=bool)
        mask[cols] = False
        return mat[:,mask]
    else:
        return mat
Philip
  • 2,888
  • 2
  • 24
  • 36
  • Excuse the ignorance, but what to you mean by "WARNING: Indices of altered axes are reset in the returned matrix"? – CamilB Jan 25 '19 at 14:30
  • 2
    @CamilB The original indices will not match the indices of the returned/new matrix, because this function creates a new one and fills it with the values that should be kept. Example: if you have 4 rows and want the 3rd row out, the row indices would not be 0, 1 and 3, but 0, 1 and 2 (the 4th row index is reset to follow the previous number). – Philip Jan 25 '19 at 14:35
  • I thought it meant something else. Thank you. – CamilB Jan 26 '19 at 09:44
6

You can delete row 0 < i < X.shape[0] - 1 from a CSR matrix X with

scipy.sparse.vstack([X[:i, :], X[i:, :]])

You can delete the first or the last row with X[1:, :] or X[:-1, :], respectively. Deleting multiple rows in one gone will probably require rolling your own function.

For other formats than CSR, this might not necessarily work as not all formats support row slicing.

Fred Foo
  • 355,277
  • 75
  • 744
  • 836
  • Oh dear, this is laborious and seems to be rather complicated for deleting multiple rows. If this is the only way of doing it, it's probably better for my purposes to convert the matrix into a numpy array, even though it's inefficient. – pemistahl Oct 25 '12 at 21:42
  • 3
    Shouldn't it be `scipy.sparse.vstack([X[:i, :], X[i+1:, :]])` – AlexConfused Jan 08 '19 at 17:20
5

To remove the i'th row from A simply use left matrix multiplication:

B = J*A

where J is a sparse identity matrix with i'th row removed.
Left multiplication by the transpose of J will insert a zero-vector back to the i'th row of B, which makes this solution a bit more general.

A0 = J.T * B

To construct J itself, I used pv.'s solution on a sparse diagonal matrix as follows (maybe there's a simpler solution for this special case?)

def identity_minus_rows(N, rows):
    if np.isscalar(rows):
        rows = [rows]
    J = sps.diags(np.ones(N), 0).tocsr()  # make a diag matrix
    for r in sorted(rows):
        J = delete_row_csr(J, r)
    return J

You may also remove columns by right-multiplying by J.T of the appropriate size.
Finally, multiplication is efficient in this case because J is so sparse.

Tomer Levinboim
  • 992
  • 12
  • 18
  • 1
    Probably he wanted to delete instead of logical erase by multiplying (i.e. making a row/column zero). – Alejandro Sazo Oct 05 '16 at 03:37
  • 1
    Row (or column) indexing, `A[index,:]`, uses this kind of multiplication. It creates an `extractor` matrix like this, and multiplies. Sparse matrix multiplication is a (relative) strong point. [Sparse matrix slicing using list of int](https://stackoverflow.com/questions/39500649/sparse-matrix-slicing-using-list-of-int) – hpaulj Apr 30 '18 at 01:06
  • I use this approach in my finite element code to remove the DoFs (degrees of freedom) in my global sparse stiffness matrix **K**. The neat advantage of this method for this usage is IMO the fact that at the beginning of a time increment, one computes the J matrix (see also [this](https://math.stackexchange.com/a/2085311/392320)) and can then use it for the multiple iterations by writing **Ksplit=J.dot(K).dot(J.transpose())** – MarcoMag Dec 16 '20 at 15:30
2

Note that sparse matrices support fancy indexing to some degree. So what you can do is this:

mask = np.ones(len(mat), dtype=bool)
mask[rows_to_delete] = False
# unfortunatly I think boolean indexing does not work:
w = np.flatnonzero(mask)
result = s[w,:]

The delete method doesn't really do anything else either.

seberg
  • 8,785
  • 2
  • 31
  • 30
2

Using @loli implementation, here I leave a function to remove columns:

def delete_cols_csr(mat, indices):
    """
    Remove the cols denoted by ``indices`` form the CSR sparse matrix ``mat``.
    """
    if not isinstance(mat, csr_matrix):
        raise ValueError("works only for CSR format -- use .tocsr() first")
    indices = list(indices)
    mask = np.ones(mat.shape[1], dtype=bool)
    mask[indices] = False
    return mat[:,mask]
Federico Caccia
  • 1,817
  • 1
  • 13
  • 33