4

I've recently been working on a project where a majority of my time is spent multiplying a dense matrix A and a sparse vector v (see here). In my attempts to reduce computation, I've noticed that the runtime of A.dot(v) is not affected by the number of zero entries of v.

To explain why I would expect the runtime to improve in this case, let result = A.dot.v so that result[j] = sum_i(A[i,j]*v[j]) for j = 1...v.shape[0]. If v[j] = 0 then clearly result[j] = 0 no matter the values A[::,j]. In this case, I would therefore expect numpy to just set result[j] = 0 but it seems as if it goes ahead and computes sum_i(A[i,j]*v[j]) anyways.

I went ahead and wrote a short sample script to confirm this behavior below.

import time
import numpy as np

np.__config__.show() #make sure BLAS/LAPACK is being used
np.random.seed(seed = 0)
n_rows, n_cols = 1e5, 1e3

#initialize matrix and vector
A = np.random.rand(n_rows, n_cols)
u = np.random.rand(n_cols)
u = np.require(u, dtype=A.dtype, requirements = ['C'])

#time
start_time = time.time()
A.dot(u)
print "time with %d non-zero entries: %1.5f seconds" % (sum(u==0.0), (time.time() - start_time))

#set all but one entry of u to zero
v = u
set_to_zero = np.random.choice(np.array(range(0, u.shape[0])), size = (u.shape[0]-2), replace=False)
v[set_to_zero] = 0.0

start_time = time.time()
A.dot(v)
print "time with %d non-zero entries: %1.5f seconds" % (sum(v==0.0), (time.time() - start_time))


#what I would really expect it to take
non_zero_index = np.squeeze(v != 0.0)
A_effective = A[::,non_zero_index]
v_effective = v[non_zero_index]


start_time = time.time()
A_effective.dot(v_effective)
print "expected time with %d non-zero entries: %1.5f seconds" % (sum(v==0.0), (time.time() - start_time))

Running this, I get that the runtime for matrix-vector multiplication is the same whether I use a dense matrix u or a sparse one v:

time with 0 non-zero entries: 0.04279 seconds
time with 999 non-zero entries: 0.04050 seconds
expected time with 999 non-zero entries: 0.00466 seconds

I am wondering if this is by design? Or am I missing something in the way that I'm running matrix-vector multiplication. Just as sanity checks: I've made sure that numpy is linked to a BLAS library on my machine and both arrays are C_CONTIGUOUS (since this is apparently required for numpy to call BLAS).

Community
  • 1
  • 1
Berk U.
  • 7,018
  • 6
  • 44
  • 69
  • Isn't this a question for the BLAS developers? Maybe it's simpler, even faster, to just ignore special cases. Testing and branching can be as expensive as simple multiplications. – hpaulj Feb 09 '16 at 01:55
  • @hpaulj It could be. I'm not sure if it's actually supposed to be built-in and I'm just not getting it to work, or whether they don't do it on purpose. The answer below helps but I think I'd like confirmation in the documentation. If it isn't built in, I'll probably have to implement it myself at this point :-/ – Berk U. Feb 09 '16 at 02:23

2 Answers2

1

How about experimenting with a simple function like?

def dot2(A,v):
    ind = np.where(v)[0]
    return np.dot(A[:,ind],v[ind])

In [352]: A=np.ones((100,100))

In [360]: timeit v=np.zeros((100,));v[::60]=1;dot2(A,v)
10000 loops, best of 3: 35.4 us per loop

In [362]: timeit v=np.zeros((100,));v[::40]=1;dot2(A,v)
10000 loops, best of 3: 40.1 us per loop

In [364]: timeit v=np.zeros((100,));v[::20]=1;dot2(A,v)
10000 loops, best of 3: 46.5 us per loop

In [365]: timeit v=np.zeros((100,));v[::60]=1;np.dot(A,v)
10000 loops, best of 3: 29.2 us per loop

In [366]: timeit v=np.zeros((100,));v[::20]=1;np.dot(A,v)
10000 loops, best of 3: 28.7 us per loop

A fully iterative Python implentation would be:

def dotit(A,v, test=False):
    n,m = A.shape  
    res = np.zeros(n)
    if test:
        for i in range(n):
            for j in range(m):
                if v[j]:
                    res[i] += A[i,j]*v[j]
    else:
        for i in range(n):
            for j in range(m):
                res[i] += A[i,j]*v[j]
    return res

Obviously this won't be as fast as the compiled dot, but I expect the relative advantages of testing still apply. For further testing you could implement it in cython.

Notice that the v[j] test occurs deep in the iteration.

For a sparse v (3 out of 100 elements) testing saves time:

In [374]: timeit dotit(A,v,True)
100 loops, best of 3: 3.81 ms per loop

In [375]: timeit dotit(A,v,False)
10 loops, best of 3: 21.1 ms per loop

but it costs time if v is dense:

In [376]: timeit dotit(A,np.arange(100),False)
10 loops, best of 3: 22.7 ms per loop

In [377]: timeit dotit(A,np.arange(100),True)
10 loops, best of 3: 25.6 ms per loop
hpaulj
  • 221,503
  • 14
  • 230
  • 353
  • Thank you for this write-up -- it helped me see the trade-off here when `v[j]` test occurs deep in the iteration. As a follow up, is there a reason why you would not use `for j in range(m)` in the outer loop? The loop I am thinking of looks something like: `nnz_cols = np.flatnonzero(v); for j in nnz_cols: val = 0.0; for i in range(n): val +=A[i,j]*v[j]; res[i] = val`?. – Berk U. Feb 10 '16 at 14:43
  • Fell free to experiment. Fewer `if` tests could shift the relative time balances a bit. – hpaulj Feb 10 '16 at 18:44
0

For simple arrays, Numpy doesn't perform such optimizations, but if You need, You may use sparse matrices which may improve dot product timings in that case. For more on the topic, see: http://docs.scipy.org/doc/scipy/reference/sparse.html

thodnev
  • 1,564
  • 16
  • 20
  • 2
    But both `A` and `v` have to be quite sparse to be faster than `np.dot`. The `sparse` multiplication is tailored to a particular data representation, and is based on a mathematics paper from several decades ago. – hpaulj Feb 09 '16 at 02:15