I have a sparse matrix (22000x97482) in csr format and i want to delete some columns (indices of columns numbers are stored in a list)
Asked
Active
Viewed 5,653 times
7
-
Does this answer your question? http://stackoverflow.com/questions/2368544/how-can-i-remove-a-column-from-a-sparse-matrix-efficiently – Gary Walker May 31 '14 at 06:11
-
Not really, it is another format of sparse matrix. i'd tried but no result – Asqan May 31 '14 at 06:14
-
I think you need to do column based slices then (bound to be slow in CSR format). If you do this a lot, CSR is probably not a good choice choice, maybe use CSC instead, e.g. http://stackoverflow.com/questions/13352280/slicing-sparse-matrices-in-scipy-which-types-work-best – Gary Walker May 31 '14 at 06:37
-
From the documentation: `Disadvantages of the CSR format slow column slicing operations (consider CSC)`. It's common practice in `sparse` to convert matricies to the optimal type before acting. – hpaulj May 31 '14 at 07:16
-
1In some quick tests, `X[I,:]` is about 10x faster than `X[:,I]` for a `csr` matrix. `X.tocsc[:,I]` is a bit faster than `X[:,I]`. So if you are doing a lot of column slicing it is worth the extra step of converting to a `csc` format. – hpaulj May 31 '14 at 19:22
2 Answers
9
If you have a very large number of columns then generating the full set of column indices can become rather costly. One slightly faster alternative would be to temporarily convert to COO format:
import numpy as np
from scipy import sparse
def dropcols_fancy(M, idx_to_drop):
idx_to_drop = np.unique(idx_to_drop)
keep = ~np.in1d(np.arange(M.shape[1]), idx_to_drop, assume_unique=True)
return M[:, np.where(keep)[0]]
def dropcols_coo(M, idx_to_drop):
idx_to_drop = np.unique(idx_to_drop)
C = M.tocoo()
keep = ~np.in1d(C.col, idx_to_drop)
C.data, C.row, C.col = C.data[keep], C.row[keep], C.col[keep]
C.col -= idx_to_drop.searchsorted(C.col) # decrement column indices
C._shape = (C.shape[0], C.shape[1] - len(idx_to_drop))
return C.tocsr()
Check equivalence:
m, n, d = 1000, 2000, 20
M = sparse.rand(m, n, format='csr')
idx_to_drop = np.random.randint(0, n, d)
M_drop1 = dropcols_fancy(M, idx_to_drop)
M_drop2 = dropcols_coo(M, idx_to_drop)
print(np.all(M_drop1.A == M_drop2.A))
# True
Benchmark:
In [1]: m, n = 1000, 1000000
In [2]: %%timeit M = sparse.rand(m, n, format='csr')
...: dropcols_fancy(M, idx_to_drop)
...:
1 loops, best of 3: 1.11 s per loop
In [3]: %%timeit M = sparse.rand(m, n, format='csr')
...: dropcols_coo(M, idx_to_drop)
...:
1 loops, best of 3: 365 ms per loop

ali_m
- 71,714
- 23
- 223
- 298
-
You could use `searchsorted` instead of the `greater_outer` for a little more speed. Would probably be a good idea to call `unique(idx_to_drop)` as well :-) – Jun 19 '16 at 22:18
-
@morningsun Thanks for fixing my mistake, and for your other suggestions. Using `searchsorted` to decrement the column indices is a major improvement over using broadcasting, especially when `idx_to_drop` is large. I wish I'd thought of it myself! – ali_m Jun 20 '16 at 00:34
7
You can use fancy indexing to obtain a new csr_matrix
with the columns that you have in your list:
all_cols = np.arange(old_m.shape[1])
cols_to_keep = np.where(np.logical_not(np.in1d(all_cols, cols_to_delete)))[0]
m = old_m[:, cols_to_keep]

Saullo G. P. Castro
- 56,802
- 26
- 179
- 234
-
3Explicitly generating all column indices is not really ideal if you have a huge number of columns – ali_m Oct 30 '15 at 12:46
-
@ali_m thank you for the feedback. I couldn't come out with a better approach. Do you have any ideas? – Saullo G. P. Castro Oct 30 '15 at 13:10