I want to compute the partition (or argpartition, but I guess the problem is quite similar) of a sparse two-dimensional matrix in an efficient way, in order to get its top-k items with low complexity.
Here is what I would do in the case of a dense matrix to get its argpartition line by line:
import numpy as np
n_best = 1
matrix = np.array([[0, 1, 2], [4, 0, 3], [0, 0, 0]])
np.argpartition(matrix, -n_best)[:, -n_best:]
which would give
array([[2],
[0],
[2]], dtype=int64)
Now that I have my sparse matrix, I do not want to convert it into a dense array because of its memory size, which would be too high to handle.
The only way I found out to easily compute its argpartition was with a for
loop line by line:
from scipy.sparse.csr import csr_matrix
sparse_matrix = csr_matrix(matrix)
sparse_argpartition = []
for row in range(sparse_matrix.shape[0]):
row_partition = sparse_matrix.getrow(row).indices[
sparse_matrix.getrow(row).data.argpartition(
-min(sparse_matrix.getrow(row).getnnz(), n_best)
)[-n_best:]
]
sparse_argpartition.append(
np.pad(row_partition, (0, n_best - len(row_partition)))
)
sparse_argpartition = np.array(sparse_argpartition)
I wonder if there is some parallelized way to efficiently perform this operation?