11

I'm doing a simple sparse matrix exponentiation, a**16, using scipy-0.17. (Note, not element-wise multiplication). However, on my machines (running Debian stable and Ubuntu LTS) this is ten times slower than using a for loop or doing something silly like a*a*a*a*a*a*a*a*a*a*a*a*a*a*a*a. This doesn't make sense, so I assume I'm doing something wrong, but what?

import scipy.sparse
from time import time

a=scipy.sparse.rand(2049,2049,.002)

print ("Trying exponentiation (a**16)")
t=time()
x=a**16
print (repr(x))
print ("Exponentiation took %f seconds\n" % (time()-t))

print ("Trying expansion (a*a*a*...*a*a)")
t=time()
y=a*a*a*a*a*a*a*a*a*a*a*a*a*a*a*a
print (repr(y))
print ("Expansion took %f seconds\n" % (time()-t))

print ("Trying a for loop (z=z*a)")
t=time()
z=scipy.sparse.eye(2049)
for i in range(16):
    z=z*a
print (repr(z))
print ("Looping took %f seconds\n" % (time()-t))

# Sanity check, all approximately the same answer, right? 
assert (abs(x-z)>=1e-9).nnz==0
assert (abs(x-y)>=1e-9).nnz==0
hackerb9
  • 1,545
  • 13
  • 14
  • Can't reproduce. Exponentiation is faster in my tests. – user2357112 Jun 19 '17 at 04:07
  • (Also, you're printing the `__repr__` method of the results, rather than the `repr`esentation.) – user2357112 Jun 19 '17 at 04:08
  • Well, I guess that's good that it works for someone. Are you using scipy-0.17? – hackerb9 Jun 19 '17 at 04:14
  • (P.S. I'll fix the repr in the question. I noticed I forgot to put abs in the sanity check, too. Neither matters much for my question, but if it bugs people, I may as well fix it.) – hackerb9 Jun 19 '17 at 04:19
  • Looking at the `a.__pow__`, it looks like `a**16` is evaluated as `a1=a*a;a1=a1*a1;a1=a1*a1;a1=a1*a1`, i.e.recursive squaring Times are consistent with that observation, about 3x faster than simple 16*. – hpaulj Jun 19 '17 at 04:28
  • @hpaulj: yes, I get that too for small matrices. Unfortunately, the problem I'm working on is not small. One person said that ** was faster for them, could this be problem with CPU cache size? – hackerb9 Jun 19 '17 at 04:29
  • Can you post your result in the question also? Mine shows it's slower also: a**16 -> `19.95 secs`, a*a*...*a -> `2.43 secs`, loop -> `2.36 secs` using Scipy 0.17.0 – justhalf Jun 19 '17 at 04:29
  • Might be due to [denormalized numbers](https://stackoverflow.com/questions/9314534/why-does-changing-0-1f-to-0-slow-down-performance-by-10x/9314926#9314926)? – justhalf Jun 19 '17 at 04:32
  • I also tried manualy doing recursive squaring, and the timing is the same as doing a**16. – justhalf Jun 19 '17 at 04:37
  • @justhalf: my speed results are equivalent to yours. If the issue was denormalized numbers, wouldn't ** still be slower for a small matrix (as hpaulj mentioned.... oops, looks like he edited that comment). On his machine and mine, exponentiation is actually 3x faster when using a small matrix (100x100). – hackerb9 Jun 19 '17 at 04:46
  • @user2357112 At least one other person has been able to replicate the problem. I'm curious if this is an artifact of the CPU cache size. Can you please try it again with a larger matrix (perhaps 20000x20000) and see if the issue occurs on your machine? – hackerb9 Jun 19 '17 at 04:58
  • 3
    With your large matrix size the `a*,...` is 10x faster. Also with this matrix the result has 400x more nonzero terms than it started with. The result is dense. On the hand with the small matrix, the number of nonzero terms decreases. – hpaulj Jun 19 '17 at 05:15
  • In my 100x100 case, density .01, some rows or columns were all 0, so the density decreased with each multiplication. But the larger matrix, even with a lower density, didn't have any all-0 rows or colulmns, and density increased. – hpaulj Jun 19 '17 at 05:37
  • I had to reduce the size somewhat to fit Ideone's timing restrictions. With the original matrix size, exponentiation is slower. It looks like it's a matter of denser matrices being multiplied in exponentiation by squaring. – user2357112 Jun 19 '17 at 05:40
  • This is a great question. It shows that dense vs sparse matrix has a large impact in scipy's implementation – justhalf Jun 19 '17 at 05:44
  • Is there any difference between `a**16.1` and `a**16`? If there's none, then one can assume some generic matrix exp algorithm is used. (Such as eigen decomposition, which is very slow) – Kh40tiK Jun 19 '17 at 07:45
  • @Kh40tiK. SciPy gives me an error when I try to do a matrix multiply a non-integer number of times. `ValueError: exponent must be an integer` – hackerb9 Jun 19 '17 at 08:31

1 Answers1

15

@hpaulj's comment about the number of nonzeros is important. As you compute higher powers of a, the number of nonzero elements increases. For sparse matrices, the time to compute a matrix product increases with the number of nonzero elements.

The algorithm used to compute a**16 is, in effect:

a2 = a*a
a4 = a2*a2
a8 = a4*a4
a16 = a8*a8

Now look at the number of nonzero elements in those matrices for a = sparse.rand(2049, 2049, 0.002):

matrix      nnz    fraction nnz
  a        8396       0.0020
  a2      34325       0.0082
  a4     521593       0.1240
  a8    4029741       0.9598

In the last product, a16 = a8*a8, the factors are 96% nonzero. Computing that product using sparse matrix multiplication is slow. That last step takes up 97% of the time to compute a**16.

On the other hand, when you compute a*a*a*a*a*a*a*a*a*a*a*a*a*a*a*a, a sparse matrix multiplication is performed 15 times, but one of the factors in each product always has a small fraction (0.002) of nonzero values, so each product can be performed reasonably efficiently.

This suggests that there is probably an optimal strategy for computing the product, balancing the number of multiplications against the sparsity of the factors. For example, computing a2 = a*a; a16 = a2*a2*a2*a2*a2*a2*a2*a2 is faster than a16 = a*a*a*a*a*a*a*a*a*a*a*a*a*a*a*a:

In [232]: %timeit a2 = a*a; a4 = a2*a2; a8 = a4*a4; a16 = a8*a8
14.4 s ± 199 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [233]: %timeit a16 = a*a*a*a*a*a*a*a*a*a*a*a*a*a*a*a
1.77 s ± 4.78 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [234]: %timeit a2 = a*a; a16 = a2*a2*a2*a2*a2*a2*a2*a2
1.42 s ± 3.16 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Or, since you know the final result will be dense, switch to standard numpy arrays, either from the beginning or at some intermediate step at which dense matrix multiplication is more efficient than sparse matrix multiplication.

Warren Weckesser
  • 110,654
  • 19
  • 194
  • 214