0

I got intrigued by the discussion in http://scipy.github.io/old-wiki/pages/PerformanceTips on how to get faster dot computations.

It is concluded dotting C_contiguous matrices should be faster, and the following results are presented

import numpy as np
from time import time
N = 1000000
n = 40
A = np.ones((N,n))

AT_F = np.ones((n,N), order='F')
AT_C = np.ones((n,N), order='C')
>>> t = time();C = np.dot(A.T, A);t1 = time() - t
3.9203271865844727
>>> t = time();C = np.dot(AT_F, A);t2 = time() - t
3.9461679458618164
>>> t = time();C = np.dot(AT_C, A);t3 = time() - t
2.4167969226837158

I tried it as well (Python 3.7) and the final computation, using C_contiguous matrices, is not faster at all!

I get the following results

 >>> t1
 0.2102820873260498
 >>> t2
 0.4134488105773926
 >>> t3
 0.28309035301208496

It turns out the first approach is the fastest.

Where is this discrepancy between their and mine calculations coming from? How can transposing in the first case not slow the calculation down?

Thanks

roganjosh
  • 12,594
  • 4
  • 29
  • 46
user37292
  • 248
  • 2
  • 12
  • Are you on Windows by any chance? – roganjosh Nov 27 '19 at 22:23
  • macbook pro, I get the fastest to be t3 = 0.09004902839660645 – seralouk Nov 27 '19 at 22:26
  • 2
    I get `A` to be fastest. But this is a silly comparison; 1) it's a single pass and the variation is all over the place, 2) OSs like Windows throttle CPU power to be power-saving and gradually release resources if there's something that is CPU-intensive. Use `timeit`, otherwise it's a pointless debate – roganjosh Nov 27 '19 at 22:33
  • I get the same pattern (as the OP) with linux and `timeit`. `t1` is noticeably faster; `t2` is slightly slower than `t3`. I think there's something about the modern `dot/@` that can detect the common base of `A.T` and `A`, and take an optimized `BLAS` route. – hpaulj Nov 27 '19 at 23:04
  • @hpaulj I'll have to look into MKL speeds tomorrow, then. – roganjosh Nov 27 '19 at 23:07
  • You edited @hpaulj , and I can't remember what statement I replied to :/ – roganjosh Nov 27 '19 at 23:11
  • [Numpy dot too clever about symmetric multiplications](https://stackoverflow.com/questions/43453707/numpy-dot-too-clever-about-symmetric-multiplications); [Numpy efficient matrix self-multiplication (gram matrix)](https://stackoverflow.com/questions/50733148/numpy-efficient-matrix-self-multiplication-gram-matrix) – hpaulj Nov 27 '19 at 23:13
  • @roganjosh, I am on Windows yes. I get the nudge that it could make this comparison meaningless...But will be looking forward to hearing more about MKL speeds and this modern dot/ features. Thanks – user37292 Nov 27 '19 at 23:24
  • @user37292 first, [this](https://stackoverflow.com/a/38562303/4799172) lesson for me. Second, you need to use `np.show_config()` to see if there are any MKL libraries in there – roganjosh Nov 27 '19 at 23:28
  • Beyond that, if you don't want anaconda, you can use one of the [unofficial binaries](https://www.lfd.uci.edu/~gohlke/pythonlibs/#numpy) – roganjosh Nov 27 '19 at 23:32
  • @roganjosh, Thanks real interesting. – user37292 Nov 29 '19 at 09:25

1 Answers1

1

My linux/timeit times:

In [122]: timeit A.T@A                                                          
258 ms ± 523 µs per loop (mean ± std. dev. of 7 runs, 1 loop each)
In [123]: timeit AT_F@A                                                         
402 ms ± 2.66 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
In [124]: timeit AT_C@A                                                         
392 ms ± 9.16 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
In [125]: %%timeit x=A.T.copy(order='F') 
     ...: x@A                                                                       
410 ms ± 18.5 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
hpaulj
  • 221,503
  • 14
  • 230
  • 353
  • most interesting, so it has to do with the use of BLAS syrk, making what I read on the quoted link rather obsolete. – user37292 Nov 27 '19 at 23:27