Here is an example filtering rows from a Pandas dataframe, first dense, then sparse.
import pandas as pd
from scipy.sparse import csr_matrix
df = pd.DataFrame({'thing': [1, 1, 2, 2, 2],
'score': [0.12, 0.13, 0.14, 0.15, 0.17]})
row_index = df['thing'] == 1
print(type(row_index), row_index)
print(df[row_index])
sdf = csr_matrix(df)
print(sdf[row_index])
The second print returns only the first two rows. The third print returns an error (see full results below).
How do I fix this code to properly filter the rows of a csr_matrix by row_index, without making it a dense matrix? In my real example, I have results of a TF/IDF vectorizer, so it has thousands of columns and I don't want to make that dense.
I've found some related questions, but I can't tell if the answer is there or not.
I'm using pandas 0.25.3 and scipy 1.3.2.
Full output of code above:
<class 'pandas.core.series.Series'> 0 True
1 True
2 False
3 False
4 False
Name: thing, dtype: bool
thing score
0 1 0.12
1 1 0.13
Traceback (most recent call last):
File "./foo.py", line 13, in <module>
print(sdf[row_index])
File "root/.venv/lib/python3.7/site-packages/scipy/sparse/_index.py", line 59, in __getitem__
return self._get_arrayXslice(row, col)
File "root/.venv/lib/python3.7/site-packages/scipy/sparse/csr.py", line 325, in _get_arrayXslice
return self._major_index_fancy(row)._get_submatrix(minor=col)
File "root/.venv/lib/python3.7/site-packages/scipy/sparse/compressed.py", line 690, in _major_index_fancy
np.cumsum(row_nnz[idx], out=res_indptr[1:])
File "<__array_function__ internals>", line 6, in cumsum
File "root/.venv/lib/python3.7/site-packages/numpy/core/fromnumeric.py", line 2423, in cumsum
return _wrapfunc(a, 'cumsum', axis=axis, dtype=dtype, out=out)
File "root/.venv/lib/python3.7/site-packages/numpy/core/fromnumeric.py", line 61, in _wrapfunc
return bound(*args, **kwds)
ValueError: provided out is the wrong size for the reduction
EDIT: This depends on scipy version. I submitted this issue to scipy.