8

Yes, I've read this and this answer, but I cannot still grasp my mind around it... it's a basic question.

In:

M[:, index]
M[index, :]

Which one is row slicing, and which one is column slicing?

And to my problem, if I want to do advanced indexing for the columns like:

M[:, indexes]  # indexes is an array like [0, 4, 9]

Which sparse matrix type is most efficient for doing M[:, indexes], CSR or CSC ?

Community
  • 1
  • 1
juanmirocks
  • 4,786
  • 5
  • 46
  • 46
  • Thanks for the downvote, I know it's a dummy question. Please enlighten me. – juanmirocks Feb 02 '17 at 09:13
  • first one is column slicing because it is selecting a column for all rows. Second is row slicing – Vivek Kumar Feb 02 '17 at 09:18
  • Be careful here. Sparse matrix slicing and indexing is different from `ndarray`. It looks the same, but it is actually implemented with matrix multiplication. So distinctions between basic v advanced indexing and copy v view don't apply. If you are worried about speed, do your own timings. – hpaulj Feb 02 '17 at 18:05
  • Where did you find `_row_slicing_`. I having seen such a method. – hpaulj Feb 02 '17 at 18:06
  • @hpaulj I was only giving emphasis -- I removed the underscores as that may be clearer – juanmirocks Feb 06 '17 at 09:35

3 Answers3

7

Actually neither is row/column slicing: those are examples of row/column indexing instead.

  • M[index, :] is row indexing
  • M[:, index] is column indexing
  • M[start:stop, :] is row slicing
  • M[:, start:stop] is column slicing

CSC is more efficient at retrieving entire columns: the non-zero values of a specific column and the matching row indices are internally stored as contiguous arrays in memory.

The dual is true for CSR and the retrieval of an entire row.

ogrisel
  • 39,309
  • 12
  • 116
  • 125
  • 1
    This makes much more sense to me. I was thinking that `:` was the slice, so that's why I was confused with `M[:, index]` being column slicing. So I guess you are saying, `:` it's not truly slicing as it just means "all", and then I'm just doing indexing of columns. -- Further, for my case, I'm not truly slicing columns only indexing (as you said). `CSC` is for my case better not exactly because this type is _better at slicing_ but more generally, because it's _better at retrieving whole columns_ -- Correct? – juanmirocks Feb 02 '17 at 10:46
  • Yes `:` alone is a shorthand for `None:None` and means the complete slice around that axis, which is equivalent to no slicing at all ;) Yes CSC is better at retrieving whole columns (both by slicing a subset of the columns and by indexing specific columns). – ogrisel Feb 02 '17 at 13:32
  • Thanks, that clarifies the confusion I had with `:` and therefore answers my question :-) -- _to slice or not to slice_ – juanmirocks Feb 02 '17 at 16:55
3

While row selection is faster for csr than for col selection, the difference is not big:

In [288]: Mbig=sparse.rand(1000,1000,.1, 'csr')
In [289]: Mbig[:1000:50,:]
Out[289]: 
<20x1000 sparse matrix of type '<class 'numpy.float64'>'
    with 2066 stored elements in Compressed Sparse Row format>
In [290]: timeit Mbig[:1000:50,:]
1000 loops, best of 3: 1.53 ms per loop
In [291]: timeit Mbig[:,:1000:50]
100 loops, best of 3: 2.04 ms per loop

In [292]: Mbig=sparse.rand(1000,1000,.1, 'csc')
In [293]: timeit Mbig[:1000:50,:]
100 loops, best of 3: 2.16 ms per loop
In [294]: timeit Mbig[:,:1000:50]
1000 loops, best of 3: 1.65 ms per loop

It isn't worth it to switch format

In [295]: timeit Mbig.tocsr()[:1000:50,:]
...
100 loops, best of 3: 2.46 ms per loop

Contrast this with the same slice for the dense version:

In [297]: A=Mbig.A
In [298]: timeit A[:,:1000:50]
...
1000000 loops, best of 3: 557 ns per loop
In [301]: timeit A[:,:1000:50].copy()
...
10000 loops, best of 3: 52.5 µs per loop

To complicate the comparison, indexing with an array (numpy advanced) is actually faster than with the 'slice':

In [308]: idx=np.r_[0:1000:50]    # expand slice into array
In [309]: timeit Mbig[idx,:]
1000 loops, best of 3: 1.49 ms per loop
In [310]: timeit Mbig[:,idx]
1000 loops, best of 3: 513 µs per loop

Here the column indexing of a csc has a greater speed improvement.

And single row or column, csr and csc have getrow/col methods:

In [314]: timeit Mbig.getrow(500)
1000 loops, best of 3: 434 µs per loop
In [315]: timeit Mbig.getcol(500)        # 1 column from csc is fastest
10000 loops, best of 3: 78.7 µs per loop
In [316]: timeit Mbig[500,:]
1000 loops, best of 3: 505 µs per loop
In [317]: timeit Mbig[:,500]
1000 loops, best of 3: 264 µs per loop

In https://stackoverflow.com/a/39500986/901925 I recreated the extractor code that sparse uses to fetch rows or columns. It constructs a new sparse 'vector' of 1s and 0s, and uses matrix multiplication to 'select' rows or columns.

Community
  • 1
  • 1
hpaulj
  • 221,503
  • 14
  • 230
  • 353
  • After testing both csr and csc, I also didn't notice a noticeable difference between them for doing column slicing (with matrices of ~ size (1k, 15k)) -- Therefore I sticked to csr as that was the original type of my matrix. – juanmirocks Feb 02 '17 at 19:04
2

It's ok to get the order mixed up every once in a while, my trick is to just picture a matrix and remember that the index order counts from up to down and then from left to right: https://en.wikipedia.org/wiki/Index_notation

down, left

So, since : means all, you know that [:, i] means all rows, and [i, :] all columns.

To the second part of your question: You want M[:, indices], so the trick is in the name: If you loop over your columns (which you do since you specify column indices for all rows), then you want compressed sparse colum format. It says so in the docs you linked:

Advantages of the CSC format

  • ...
  • efficient column slicing
Community
  • 1
  • 1
instant
  • 676
  • 6
  • 12