1

I am working on sparse.csr.csr_matrix of size (4860x89462 sparse matrix of type '<class 'numpy.float64'>'with 9111761 stored elements) and using jupyter notebook 3.7.4

My requirement is to extract the top 2 results based on the Value of the elements in sparse matrix.

I am sharing one example of my sample sparse csr_matrix

Current Sparse matrix

  (1, 1)    0.5
  (1, 5)    0.66
  (1, 6)    1.0
  (2, 2)    1.0
  (2, 3)    0.5
  (2, 7)    0.33

Desired Sparse matrix

  (1, 6)    1.0
  (1, 5)    0.66
  (2, 2)    1.0
  (2, 3)    0.5

I am looking for solution which can work on huge matrix without taking much time.

Thanks in advance.

hpaulj
  • 221,503
  • 14
  • 230
  • 353
  • So the largest N values for each row? I vaguely recall such a SO in past, though it might have some years ago. It probably involves iterating on the `A.indptr` to get the values for each row. – hpaulj Oct 06 '20 at 16:19
  • 1
    https://stackoverflow.com/questions/49207275/finding-the-top-n-values-in-a-row-of-a-scipy-sparse-matrix – hpaulj Oct 07 '20 at 20:25

1 Answers1

0
top_n = 2
out = []

for r in arr:
    if r.data.size <= top_n:
        out.append(r)
    else:
        top_hits = np.argsort(r.data)[-1 * top_n:]
        out.append(sparse.csr_matrix((r.data[top_hits], r.indices[top_hits], np.array([0,len(top_hits)])), shape=(1, arr.shape[1])))
        
out = sparse.vstack(out)

This is just not gonna be fast. I don't know of any better way to do it though.

CJR
  • 3,916
  • 2
  • 10
  • 23