3

I have a scipy sparse matrix in CSR format. It's 72665x72665 so it's impractical to convert this matrix to a dense matrix to perform operations on (the dense representation of this matrix is like 40 gigs). The matrix is symmetric, and has about 82 million non-zero entries (~1.5%).

What I would like to be able to do is, for each row, I want to get the indices of the largest N values. If this were a numpy array, I would use np.argpartition to do it like so:

    for row in matrix:
        top_n_idx = np.argpartition(row,-n)[-n:]

Is there something similar to this I can do for a sparse matrix?

enumaris
  • 1,868
  • 1
  • 16
  • 34
  • Similar questions from the past: [Get top-n items of every row in a scipy sparse matrix](https://stackoverflow.com/questions/36135927), and [Scipy.sparse.csr_matrix: How to get top ten values and indices?](https://stackoverflow.com/questions/31790819) – hpaulj Mar 10 '18 at 21:38

3 Answers3

5

Improve from @Paul Panzer's solution. Now it can handle the case when any row has less than n values.

def top_n_idx_sparse(matrix, n):
    """Return index of top n values in each row of a sparse matrix."""
    top_n_idx = []
    for le, ri in zip(matrix.indptr[:-1], matrix.indptr[1:]):
        n_row_pick = min(n, ri - le)
        top_n_idx.append(
            matrix.indices[
                le + np.argpartition(matrix.data[le:ri], -n_row_pick)[-n_row_pick:]
            ]
        )
    return top_n_idx

What it does?

The matrix.indptr gives the indices of the beginning of each row stored in the data array. So (lr, ri) is the range of data indices of non-zero values in each row. matrix.data[le:ri] gives the non-zero values of the row. Passing that to np.argpartition(..., -n_row_pick) gives you the local indices that will sort the row for the largest n_row_pick elements from the back. [-n_row_pick:] select those local indices. Then le + shift those local indices back to the indices in the data array. Finally, pass that to matrix.indices to get the largest n values indices in the matrix space.

Louis Yang
  • 3,511
  • 1
  • 25
  • 24
  • Can you maybe give a short explanation of what it does i.e why it works? – CutePoison Nov 16 '22 at 12:38
  • So this relates to how CSR format work. You can read http://scipy-lectures.org/advanced/scipy_sparse/csr_matrix.html as point out by Paul Panzer below. In the question statement, we already know we can find the locations of largest k numbers using `np.argpartition`. Here we simply specify the range of indices in the CSR format. – Louis Yang Nov 18 '22 at 02:35
2

Directly using the CSR format and assuming there are enough positive nonzeros in each row you can write:

for le, ri in zip(matrix.indptr[:-1], matrix.indptr[1:]):
    top_n_idx = matrix.indices[le + np.argpartition(matrix.data[le:ri], -n)[-n:]]
Paul Panzer
  • 51,835
  • 3
  • 54
  • 99
  • Could you explain what this code snippet is doing? And there are many rows that are all 0, so if there is a need for every row to have "enough" nonzeros (what's enough?), I don't think my matrix satisfies that condition. – enumaris Mar 10 '18 at 22:23
  • @enumaris Well, enough is exactly `n`. If there are fewer than `n` positives then you have to fill up with zeros which is, of course, perfectly doable but tedious because they are not explicitly stored in a `csr` matrix. And, theoretically, there could even be negatives in the top `n` which would lead to even more tedious code. Re what the code is doing, I'll have to ask you to read up on the [`csr` format](http://www.scipy-lectures.org/advanced/scipy_sparse/csr_matrix.html) (just the first paragraph) and it should become clear. If not I'll be happy to explain in more detail. – Paul Panzer Mar 10 '18 at 23:35
0

Quick and dirty:

Find top n values:

np.sort(matrix[i,:].data)[-n:]

Find indices for top n values:

matrix[i,:].nonzero()[1][np.argsort(matrix[i,:].data)[-n:]]
Radio Controlled
  • 825
  • 8
  • 23