1

I need to perform online training on a TF-IDF model. I found that scipy's TfidfVectorizer does not support training on online fashion, so I'm implementing my own CountVectorizer to support online training and then use the scipy's TfidfTransformer to update tf-idf values after a pre-defined number of documents have entered in the corpus.

I found here that you shouldn't be adding rows or columns to numpy arrays since all data would need to be copied so it is stored in contiguous blocks of memory.

But then I also found that in fact, using scipy sparse matrix you can manually change the matrix's shape.

Numpy reshape docs says:

It is not always possible to change the shape of an array without copying the data. If you want an error to be raised when the data is copied, you should assign the new shape to the shape attribute of the array

Since the "reshaping" of the sparse matrix is being done by assigning a new shape, is it safe to say data is not being copied? What are the implications of doing so? Is it efficient?

Code example:

matrix = sparse.random(5, 5, .2, 'csr') # Create (5,5) sparse matrix
matrix._shape = (6, 6) # Change shape to (6, 6)
# Modify data on new empty row

I would also like to expand my question to ask about methods such as vstack that allows one to append arrays to one another (same as adding a row). Is vstack copying the whole data so it gets stored as contiguous blocks of memory as stated in my first link? What about hstack?


EDIT: So, following this question I've implemented a method to alter the values of a row in a sparse matrix.

Now, mixing the idea of adding new empty rows with the idea of modifying existing values I've come up with the following:

matrix = sparse.random(5, 3, .2, 'csr')
matrix._shape = (6, 3)
# Update indptr to let it know we added a row with nothing in it.
matrix.indptr = np.hstack((matrix.indptr, matrix.indptr[-1]))

# New elements on data, indices format
new_elements = [1, 1]
elements_indices = [0, 2] 

# Set elements for new empty row
set_row_csr_unbounded(matrix, 5, new_elements, elements_indices)

I run the above code a few times during the same execution and got no error. But as soon as I try to add a new column (then there would be no need to change indptr) I get an error when I try to alter the values. Any lead on why this happen?

Well, since set_row_csr_unbounded uses numpy.r_ underneath, I assume I'm better using a lil_matrix. Even if all the elements, once added cannot be modified. Am I right?

I think that lil_matrix would be ebtter because I assume numpy.r_ is copying the data.

leoschet
  • 1,697
  • 17
  • 33

1 Answers1

3

In numpy reshape means to change the shape in such a way that keeps the same number elements. So the product of the shape terms can't change.

The simplest example is something like

np.arange(12).reshape(3,4)

The assignment method is:

x = np.arange(12)
x.shape = (3,4)

The method (or np.reshape(...)) returns a new array. The shape assignment works in-place.

The docs note that you quote comes into play when doing something like

x = np.arange(12).reshape(3,4).T
x.reshape(3,4)   # ok, but copy
x.shape = (3,4)  # raises error

To better understand what's happening here, print the array at different stages, and look at how the original 0,1,2,... contiguity changes. (that's left as an exercise for the reader since it isn't central to the bigger question.)

There is a resize function and method, but it isn't used much, and its behavior with respect to views and copies is tricky.

np.concatenate (and variants like np.stack, np.vstack) make new arrays, and copy all the data from the inputs.

A list (and object dtype array) contains pointers to the elements (which may be arrays), and so don't require copying data.

Sparse matrices store their data (and row/col indices) in various attributes that differ among the formats. coo, csr and csc have 3 1d arrays. lil has 2 object arrays containing lists. dok is a dictionary subclass.

lil_matrix implements a reshape method. The other formats do not. As with np.reshape the product of the dimensions can't change.

In theory a sparse matrix could be 'embedded' in a larger matrix with minimal copying of data, since all the new values will be the default 0, and not occupy any space. But the details for that operation have not been worked out for any of the formats.

sparse.hstack and sparse.vstack (don't use the numpy versions on sparse matrices) work by combining the coo attributes of the inputs (via sparse.bmat). So yes, they make new arrays (data, row, col).

A minimal example of making a larger sparse matrix:

In [110]: M = sparse.random(5,5,.2,'coo')
In [111]: M
Out[111]: 
<5x5 sparse matrix of type '<class 'numpy.float64'>'
    with 5 stored elements in COOrdinate format>
In [112]: M.A
Out[112]: 
array([[0.        , 0.80957797, 0.        , 0.        , 0.        ],
       [0.        , 0.        , 0.        , 0.        , 0.        ],
       [0.        , 0.23618044, 0.        , 0.91625967, 0.8791744 ],
       [0.        , 0.        , 0.        , 0.        , 0.        ],
       [0.        , 0.        , 0.        , 0.7928235 , 0.        ]])
In [113]: M1 = sparse.coo_matrix((M.data, (M.row, M.col)),shape=(7,5))
In [114]: M1
Out[114]: 
<7x5 sparse matrix of type '<class 'numpy.float64'>'
    with 5 stored elements in COOrdinate format>
In [115]: M1.A
Out[115]: 
array([[0.        , 0.80957797, 0.        , 0.        , 0.        ],
       [0.        , 0.        , 0.        , 0.        , 0.        ],
       [0.        , 0.23618044, 0.        , 0.91625967, 0.8791744 ],
       [0.        , 0.        , 0.        , 0.        , 0.        ],
       [0.        , 0.        , 0.        , 0.7928235 , 0.        ],
       [0.        , 0.        , 0.        , 0.        , 0.        ],
       [0.        , 0.        , 0.        , 0.        , 0.        ]])
In [116]: id(M1.data)
Out[116]: 139883362735488
In [117]: id(M.data)
Out[117]: 139883362735488

M and M1 have the same data attribute (same array id). But most operations on these matrices will require a conversion to another format (such as csr for math, or lil for changing values), and will involve copying and modifying the attributes. So this connection between the two matrices will be broken.

When you make a sparse matrix with a function like coo_matrix, and don't provide a shape parameter, it deduces the shape from the provided coordinates. If you provide a shape it uses that. That shape has to be at least as large as the implied shape. With lil (and dok) you can profitably create an 'empty' matrix with a large shape, and then set values iteratively. You don't want to do that with csr. And you can't directly set coo values.

The canonical way of creating sparse matrices is to build the data, row, and col arrays or lists iteratively from various pieces - with list append/extend or array concatenates, and make a coo (or csr) format array from that. So you do all the 'growing' before even creating the matrix.

changing _shape

Make a matrix:

In [140]: M = (sparse.random(5,3,.4,'csr')*10).astype(int)
In [141]: M
Out[141]: 
<5x3 sparse matrix of type '<class 'numpy.int64'>'
    with 6 stored elements in Compressed Sparse Row format>
In [142]: M.A
Out[142]: 
array([[0, 6, 7],
       [0, 0, 6],
       [1, 0, 5],
       [0, 0, 0],
       [0, 6, 0]])

In [144]: M[1,0] = 10
... SparseEfficiencyWarning)
In [145]: M.A
Out[145]: 
array([[ 0,  6,  7],
       [10,  0,  6],
       [ 1,  0,  5],
       [ 0,  0,  0],
       [ 0,  6,  0]])

your new shape method (make sure the dtype of indptr doesn't change):

In [146]: M._shape = (6,3)
In [147]: newptr = np.hstack((M.indptr,M.indptr[-1]))
In [148]: newptr
Out[148]: array([0, 2, 4, 6, 6, 7, 7], dtype=int32)
In [149]: M.indptr = newptr
In [150]: M
Out[150]: 
<6x3 sparse matrix of type '<class 'numpy.int64'>'
    with 7 stored elements in Compressed Sparse Row format>
In [151]: M.A
Out[151]: 
array([[ 0,  6,  7],
       [10,  0,  6],
       [ 1,  0,  5],
       [ 0,  0,  0],
       [ 0,  6,  0],
       [ 0,  0,  0]])
In [152]: M[5,2]=10
... SparseEfficiencyWarning)
In [153]: M.A
Out[153]: 
array([[ 0,  6,  7],
       [10,  0,  6],
       [ 1,  0,  5],
       [ 0,  0,  0],
       [ 0,  6,  0],
       [ 0,  0, 10]])

Adding a column also seems to work:

In [154]: M._shape = (6,4)
In [155]: M
Out[155]: 
<6x4 sparse matrix of type '<class 'numpy.int64'>'
    with 8 stored elements in Compressed Sparse Row format>
In [156]: M.A
Out[156]: 
array([[ 0,  6,  7,  0],
       [10,  0,  6,  0],
       [ 1,  0,  5,  0],
       [ 0,  0,  0,  0],
       [ 0,  6,  0,  0],
       [ 0,  0, 10,  0]])
In [157]: M[5,0]=10
.... SparseEfficiencyWarning)
In [158]: M[5,3]=10
.... SparseEfficiencyWarning)
In [159]: M
Out[159]: 
<6x4 sparse matrix of type '<class 'numpy.int64'>'
    with 10 stored elements in Compressed Sparse Row format>
In [160]: M.A
Out[160]: 
array([[ 0,  6,  7,  0],
       [10,  0,  6,  0],
       [ 1,  0,  5,  0],
       [ 0,  0,  0,  0],
       [ 0,  6,  0,  0],
       [10,  0, 10, 10]])

attribute sharing

I can make a new matrix from an existing one:

In [108]: M = (sparse.random(5,3,.4,'csr')*10).astype(int)
In [109]: newptr = np.hstack((M.indptr,6))
In [110]: M1 = sparse.csr_matrix((M.data, M.indices, newptr), shape=(6,3))

The data attributes a shared, at least in view sense:

In [113]: M[0,1]=14
In [114]: M1[0,1]
Out[114]: 14

But if I modify M1 by adding a nonzero value:

In [117]: M1[5,0]=10
...
  SparseEfficiencyWarning)

The link between the matrices breaks:

In [120]: M[0,1]=3
In [121]: M1[0,1]
Out[121]: 14
hpaulj
  • 221,503
  • 14
  • 230
  • 353
  • Thanks for your answer, I have a question tho, say I create a `csr_matrix`, with shape `(3,5)`, for example, then I do `matrix._shape = (3,6)` since all new elements are 0 it does not occupy any additional space, but as soon as I modify these 0 for another value, it will occupy more space, correct? If so, will it copy all the data when doing so? – leoschet May 22 '18 at 11:01
  • You will get an efficiency warning, – hpaulj May 22 '18 at 11:09
  • The thing is, it doesn't, I've tried as shown [here](https://stackoverflow.com/a/17457226/7454638) and got no warning – leoschet May 22 '18 at 11:10
  • Changing the private attribute `_shape` won't give a warning, but adding new values will. – hpaulj May 22 '18 at 13:35
  • I see, I implemented a [method to set rows](https://stackoverflow.com/questions/28427236/set-row-of-csr-matrix/50468983#50468983) on a `csr_matrix` and if I only add **new rows** manually changing `_shape` and the `indptr` array I get no errors, but as soon as add a **new column** and try to add some values, I get an error... – leoschet May 22 '18 at 14:10
  • Another thing, I'm a bit confused with the following: I'm able to add empty rows to a `csr_matrix` changing `_shape` and `indptr`. I'm also able to alter this row's values with the implemented method. How does it not return an error when altering the values? If adding rows work, why adding columns does not? As far as I can see `A.data`, `A.indices` and `A.indptr` are the same no matter how many 0-columns you add... – leoschet May 22 '18 at 14:16
  • 1
    When you start modifying the matrix attributes directly you need to understand them. Numerous SO answers have suggested fast iteration on rows using `indptr` directly, but that's for calculations that don't change sparsity. When you start changing the sparsity (adding or removing nonzero values) directly you are taking risks. I have not attempted to do that myself, though I probably could debug the code. – hpaulj May 22 '18 at 23:18
  • Indeed one must understand the attributes, I still curious to know why one can't add columns. I mean a 2x3 and a 2x4 matrix will have the same values on its attributes since the bigger one differs from the smaller only by a new column of 0s. Thanks for your time, I'm inclined to accept your answer since it explains most of the underlying details. – leoschet May 23 '18 at 16:15