Efficiently applying a threshold function to SciPy sparse csr_matrix

Question

I have a SciPy csr_matrix (a vector in this case) of 1 column and x rows. In it are float values which I need to convert to the discrete class labels -1, 0 and 1. This should be done with a threshold function which maps the float values to one of these 3 class labels.

Is there no way other than iterating over the elements as described in Iterating through a scipy.sparse vector (or matrix)? I would love to have some elegant way to just somehow map(thresholdfunc()) on all elements.

Note that while it is of type csr_matrix, it isn't actually sparse as it's just the return of another function where a sparse matrix was involved.

`csc` format is much more compact when representing a column vector. — hpaulj, Jul 02 '17 at 17:34
If it isn't particularly sparse, then I'd suggest working on the dense array version, `M.toarray()` (or `M.A`). — hpaulj, Jul 02 '17 at 17:36
Thanks! Is it worth it to convert it to an array first if I'm only going to iterate over it once? I mean, doesn't the overhead of the conversion maybe eat the advantages of the array? — Florian, Jul 03 '17 at 09:34

score 4 · Accepted Answer · answered Jul 03 '17 at 15:36

If you have an array, you can discretize based on some condition with the np.where function. e.g.:

>>> import numpy as np
>>> x = np.arange(10)
>>> np.where(x < 5, 0, 1)
array([0, 0, 0, 0, 0, 1, 1, 1, 1, 1])

The syntax is np.where(BOOLEAN_ARRAY, VALUE_IF_TRUE, VALUE_IF_FALSE). You can chain together two where statements to have multiple conditions:

>>> np.where(x < 3, -1, np.where(x > 6, 0, 1))
array([-1, -1, -1,  1,  1,  1,  1,  0,  0,  0])

To apply this to your data in the CSR or CSC sparse matrix, you can use the .data attribute, which gives you access to the internal array containing all the nonzero entries in the sparse matrix. For example:

>>> from scipy import sparse
>>> mat = sparse.csr_matrix(x.reshape(10, 1))
>>> mat.data = np.where(mat.data < 3, -1, np.where(mat.data > 6, 0, 1))
>>> mat.toarray()
array([[ 0],
       [-1],
       [-1],
       [ 1],
       [ 1],
       [ 1],
       [ 1],
       [ 0],
       [ 0],
       [ 0]])

This seems nice in terms of syntactic sugar, thanks. I guess it wouldn't be much faster though, from a quick search concerning np.where() performance. — Florian, Jul 04 '17 at 18:22
np.where is similar to other numpy vectorized functions, in which the loops are executed in compiled code. In general it will be significantly faster than iterating over elements using a Python for loop for all but the smallest arrays. — jakevdp, Jul 04 '17 at 20:38

Efficiently applying a threshold function to SciPy sparse csr_matrix

1 Answers1