1

I am trying to find an efficient way that lets me increase the top k values of a sparse matrix by some constant value. I am currently using the following code, which is quite slow for very large matrices:

a = csr_matrix((2,2)) #just some sample data
a[1,1] = 3.
a[0,1] = 2.

y = a.tocoo()
idx = y.data.argsort()[::-1][:1] #k is 1
for i, j in izip(y.row[idx], y.col[idx]):
    a[i,j] += 1

Actually the sorting seems to be fast, the problem lies with my final loop where I increase the values by indexing via the sorted indices. Hope someone has an idea of how to speed this up.

fsociety
  • 1,791
  • 4
  • 22
  • 32

1 Answers1

6

You could probably speed things up quite a lot by directly modifying a.data rather than iterating over row/column indices and modifying individual elements:

idx = a.data.argsort()[::-1][:1] #k is 1
a.data[idx] += 1

This also saves converting from CSR --> COO.

Update

As @WarrenWeckesser rightly points out, since you're only interested in the indices of the k largest elements and you don't care about their order, you can use argpartition rather than argsort. This can be quite a lot faster when a.data is large.

For example:

from scipy import sparse

# a random sparse array with 1 million non-zero elements
a = sparse.rand(10000, 10000, density=0.01, format='csr')

# find the indices of the 100 largest non-zero elements
k = 100

# using argsort:
%timeit a.data.argsort()[-k:]
# 10 loops, best of 3: 135 ms per loop

# using argpartition:
%timeit a.data.argpartition(-k)[-k:]
# 100 loops, best of 3: 13 ms per loop

# test correctness:
np.all(a.data[a.data.argsort()[-k:]] == 
       np.sort(a.data[a.data.argpartition(-k)[-k:]]))
# True
ali_m
  • 71,714
  • 23
  • 223
  • 298
  • Cheers, that does the job pretty well! – fsociety Jul 21 '14 at 15:10
  • 3
    Since all that is needed is the `k` largest elements, you could use `arpartition` instead of `argsort`. This could make a noticeable improvement in performance if `a.data` is large. – Warren Weckesser Jul 21 '14 at 18:26
  • @WarrenWeckesser great suggestion - I've updated my answer with some benchmarking of the two methods – ali_m Jul 21 '14 at 20:12