1

I am writing a function that pulls the top x values from a sparse vector (fewer values if there are less than x). I would like to include an "in-place" option like many functions have, where it removes the top values if the option is True, and keeps them if the option is False.

My issue is that my current function is overwriting the input vector, rather than keeping it as is. I am not sure why this is occurring. I expected the way to solve my design problem would be to include an if statement which would copy the input using copy.copy(), but this raises a value error (ValueError: row index exceeds matrix dimensions) which does not make sense to me.

Code:

from scipy.sparse import csr_matrix
import copy

max_loc=20
data=[1,3,3,2,5]
rows=[0]*len(data)
indices=[4,2,8,12,7]

sparse_test=csr_matrix((data, (rows,indices)), shape=(1,max_loc))

print(sparse_test)

def top_x_in_sparse(in_vect,top_x,inplace=False):
    if inplace==True:
        rvect=in_vect
    else:
        rvect=copy.copy(in_vect)
    newmax=top_x
    count=0
    out_list=[]
    while newmax>0:
        newmax=1
        if count<top_x:
            out_list+=[csr_matrix.max(rvect)]
            remove=csr_matrix.argmax(rvect)
            rvect[0,remove]=0
            rvect.eliminate_zeros()
            newmax=csr_matrix.max(rvect)
            count+=1
        else:
            newmax=0
    return out_list

a=top_x_in_sparse(sparse_test,3)

print(a)

print(sparse_test)

My question has two parts:

  1. how do I prevent this function from overwriting the vector?
  2. how do I add the inplace option?
amquack
  • 837
  • 10
  • 24
  • A `csr_matrix` is not a `ndarray`, so the usual (for `numpy` users) notions of inplace, view, and copy don't apply. We need read the sparse docs and test the code step by step. If I recall correctly `eliminate_zeros` is one of the few in-place methods (that's clearly stated its docs). – hpaulj Dec 30 '20 at 01:31
  • 1
    I'd use `rvect=in_vect.copy()` to make a copy (though `copy.copy` probably delegates to this method). Beware that sparse copies are not fast. – hpaulj Dec 30 '20 at 01:59
  • Why are you using the `csr_matrix.argmax(rvect)` format instead of `rvect.argmax()`? That is, I use a method where possible. – hpaulj Dec 30 '20 at 02:02
  • `rvect[0,remove]=0` is another in-place action. If the shape is always `(1,max_loc)`, i.e. 1 row, I suspect working with dense `ndarray` will be faster (and still fit in memory unless `max_loc` is extremely large). – hpaulj Dec 30 '20 at 02:21
  • @hpaulj There's some good information here that I'd like to unpack. "Beware that sparse copies are not fast" - why is this? It seems that they should behave the same as a sparse, or perhaps I misunderstand what copy does? "I use a method where possible" - I have not heard this guidance before - what's the rationale behind this? – amquack Dec 30 '20 at 15:42
  • As for the size of the arrays, I just pulled up a file and the vector dimensions would be 1 by 1.6milion, and there would be several thousand of these vectors. – amquack Dec 30 '20 at 15:45
  • 3) "usual notions of inplace, view, and copy don't apply" - I thought views/copy were a python thing, can you elaborate? (Ex: a=[3] // x=a // a[0]=2 // print(a,x) shows that x is a view of a as both changed.) – amquack Dec 30 '20 at 15:52
  • @hpaulj I tried your suggestion to use in_vect.copy() instead, and now the inplace functionality is WAI. Additionally, I do not need to import copy for this. You can post this as an answer and I'll accept. I'd appreciate your thoughts on some of the questions I asked in the comments as well. – amquack Dec 30 '20 at 16:04
  • As written you will reallocate and copy the backing data arrays every iteration of your loop. I would consider not doing that. – CJR Dec 31 '20 at 06:54
  • @CJR Is this a result of including the `rvect.eliminate_zeros()` line? I considered not including this (or just including it at the end), as it probably doesn't really matter that I do this at every iteration. – amquack Dec 31 '20 at 15:48

1 Answers1

1

You really just don't want to loop period. Reallocating those arrays every loop iteration with .eliminate_zeros() is the slowest thing but not the only reason not do to it.

import numpy as np
from scipy.sparse import csr_matrix

max_loc=20
data=[1,3,3,2,5]
rows=[0]*len(data)
indices=[4,2,8,12,7]

sparse_test=csr_matrix((data, (rows,indices)), shape=(1,max_loc))

Something like this would be better:

def top_x_in_sparse(in_vect,top_x,inplace=False):
    
    n = len(in_vect.data)
    
    if top_x >= n:
        if inplace:
            out_data = in_vect.data.tolist()
            in_vect.data = np.array([], dtype=in_vect.data.dtype)
            in_vect.indices = np.array([], dtype=in_vect.indices.dtype)
            in_vect.indptr = np.array([0, 0], dtype=in_vect.indptr.dtype)
            return out_data
        else:
            return in_vect.data.tolist()
        
    else:
        k = n - top_x
        partition_idx = np.argpartition(in_vect.data, k)

        if inplace:
            out_data = in_vect.data[partition_idx[k:n]].tolist()
            in_vect.data = in_vect.data[partition_idx[0:k]]
            in_vect.indices = in_vect.indices[partition_idx[0:k]]
            in_vect.indptr = np.array([0, len(in_vect.data)], dtype=in_vect.indptr.dtype)            
            return out_data
        else:
            return in_vect.data[partition_idx[k:n]].tolist()
    

If you need the returned values sorted you can do that as well of course.

a=top_x_in_sparse(sparse_test,3,inplace=False)

>>> print(a)
[3, 5, 3]

>>> print(sparse_test)
  (0, 2)    3
  (0, 4)    1
  (0, 7)    5
  (0, 8)    3
  (0, 12)   2

b=top_x_in_sparse(sparse_test,3,inplace=True)

>>> print(b)
[3, 5, 3]

>>> print(sparse_test)
  (0, 4)    1
  (0, 12)   2

Also as per your question from the comments: a shallow copy of the sparse array object will not copy the numpy arrays that hold the data. The sparse object only has a reference to those objects. A deep copy would get them, but using the built-in method for copy already takes care of knowing which things that are referenced need to be copied and which dont.

CJR
  • 3,916
  • 2
  • 10
  • 23
  • My first thought was that this simply converted to np arrays and did not preserve sparsity (simply converting to numpy array), but I think now that the key is your use of `.data`. I looked at scipy.sparse to see if I could find documentation, but was unsuccessful. `.indices` appears to do the same thing with the location in the array. Where can I look to find out more information about these types of operations? The idea here appears to be to pull out the non-sparse elements of the array, and then use numpy functions to conduct operations on this now dense vector, which is very nice. – amquack Dec 31 '20 at 19:29
  • 1
    The sparse matrices are just some memory-contiguous numpy arrays and the code that lets them get interpreted. `.data`, `.indices` and `.indptr` are the references to those arrays. Different sparse types will have different backing data structures. See here for a CSR example: https://stackoverflow.com/questions/52299420/scipy-csr-matrix-understand-indptr – CJR Dec 31 '20 at 20:48