How sparse should a numpy vector be to run faster

Question

I saw a post on stack overflow where someone showed that the CSR representation of a vector/matrix was slower for computations compared to just using the typical matrix/vector format for various numpy computations. The speed seems to depend on the computation and how sparse the vectors or matrices are.

I have lots of sparse vectors (average number of 0s is 66%) for which I would like to take the dot product of. Note that all elements in my vectors are either a 0 or 1. Which representation is better for this (eg. csr, normal vector, etc.) in terms of computational speed? Does it depend on how sparse my vector is? If so, is there a certain sparsity (%) after which one is better than the other?

Any help with this issue is much appreciated! Thanks in advance!

Maybe consider converting the sparse vectors to sets, and calculating the size of their intersections? — Mark Lavin, Aug 09 '21 at 16:04
If you take the dotproduct of two sparse vectors, are you assuming that they are nonzero in the same locations? Btw, CSR is a matrix representation, so that's not relevant to you. Historical note: CSR was developed for numerical PDE applications, where typically you can have matrices of size million (or these days billion), but at most a few dozen nonzeros per row. Dense storage would not even be possible, never mind speed difference. However, the vectors are usually dense. — Victor Eijkhout, Aug 09 '21 at 16:05
Thanks for the suggestion @MarkLavin. I am not sure if it it'll be faster as I have access to a GPU and they are optimised for vector/matrix calculations so it may be faster. However, it is definitely worth trying :) — Alex Pharaon, Aug 09 '21 at 16:17
Thanks for the clarification @VictorEijkhout - I thought it applied to vectors as well. I guess I mean saving the row numbers where there is a 1 for each column vector. I have 10,000 different vectors and each has 60,000 rows. On average, there is about 5% 1s I would estimate. — Alex Pharaon, Aug 09 '21 at 16:19
The linked answer has "vector" examples, if by vector you mean a (1,n) shaped array (or sparse matrix). Looks like I showed where CSR has an advantage - large matrices (relatively square) with sparsity on the order of 1% or less. — hpaulj, Aug 09 '21 at 18:01
For vectors sparseness is not such a useful concept. For matrices way more important. In many LP solvers, the LP matrix and the basis factorization are done use sparse technology while most (or even all) vectors are stored dense. Sometimes some vectors are also stored sparse but that is not as usual. — Erwin Kalvelagen, Aug 09 '21 at 19:51
I tend to use `sparsity` and `density` to mean the same, the proportion of nonzero values. But checking the docs, I only see `sparsity structure`, referring to where the nonzero values are located. — hpaulj, Aug 09 '21 at 20:31
I have never heard before that sparsity=density. I only have seen sparsity=1-density If a matrix has a density of 0.01% it is 99.99% sparse. In other words, a matrix cannot be very sparse and very dense at the same time. — Erwin Kalvelagen, Aug 09 '21 at 23:40
Also worth remembering that the standard (dense) array operations are run through the highly optimized BLAS libraries and the `scipy.sparse` array operations are all in some single-threaded unoptimized C extensions. — CJR, Aug 10 '21 at 20:48
Ahhh ok @CJR, I did not know that -- thanks! Seems the standard dense arrays are probably the way to go for my problem then. — Alex Pharaon, Aug 10 '21 at 21:45

How sparse should a numpy vector be to run faster

0 Answers0