Numpy optimisation - C and F_Contiguous Matrices

Question

I got intrigued by the discussion in http://scipy.github.io/old-wiki/pages/PerformanceTips on how to get faster dot computations.

It is concluded dotting C_contiguous matrices should be faster, and the following results are presented

import numpy as np
from time import time
N = 1000000
n = 40
A = np.ones((N,n))

AT_F = np.ones((n,N), order='F')
AT_C = np.ones((n,N), order='C')
>>> t = time();C = np.dot(A.T, A);t1 = time() - t
3.9203271865844727
>>> t = time();C = np.dot(AT_F, A);t2 = time() - t
3.9461679458618164
>>> t = time();C = np.dot(AT_C, A);t3 = time() - t
2.4167969226837158

I tried it as well (Python 3.7) and the final computation, using C_contiguous matrices, is not faster at all!

I get the following results

 >>> t1
 0.2102820873260498
 >>> t2
 0.4134488105773926
 >>> t3
 0.28309035301208496

It turns out the first approach is the fastest.

Where is this discrepancy between their and mine calculations coming from? How can transposing in the first case not slow the calculation down?

Thanks

macbook pro, I get the fastest to be t3 = 0.09004902839660645 — seralouk, Nov 27 '19 at 22:26
I get `A` to be fastest. But this is a silly comparison; 1) it's a single pass and the variation is all over the place, 2) OSs like Windows throttle CPU power to be power-saving and gradually release resources if there's something that is CPU-intensive. Use `timeit`, otherwise it's a pointless debate — roganjosh, Nov 27 '19 at 22:33
I get the same pattern (as the OP) with linux and `timeit`. `t1` is noticeably faster; `t2` is slightly slower than `t3`. I think there's something about the modern `dot/@` that can detect the common base of `A.T` and `A`, and take an optimized `BLAS` route. — hpaulj, Nov 27 '19 at 23:04
You edited @hpaulj , and I can't remember what statement I replied to :/ — roganjosh, Nov 27 '19 at 23:11
[Numpy dot too clever about symmetric multiplications](https://stackoverflow.com/questions/43453707/numpy-dot-too-clever-about-symmetric-multiplications); [Numpy efficient matrix self-multiplication (gram matrix)](https://stackoverflow.com/questions/50733148/numpy-efficient-matrix-self-multiplication-gram-matrix) — hpaulj, Nov 27 '19 at 23:13
@roganjosh, I am on Windows yes. I get the nudge that it could make this comparison meaningless...But will be looking forward to hearing more about MKL speeds and this modern dot/ features. Thanks — user37292, Nov 27 '19 at 23:24
@user37292 first, [this](https://stackoverflow.com/a/38562303/4799172) lesson for me. Second, you need to use `np.show_config()` to see if there are any MKL libraries in there — roganjosh, Nov 27 '19 at 23:28
Beyond that, if you don't want anaconda, you can use one of the [unofficial binaries](https://www.lfd.uci.edu/~gohlke/pythonlibs/#numpy) — roganjosh, Nov 27 '19 at 23:32

score 1 · Accepted Answer · answered Nov 27 '19 at 23:16

My linux/timeit times:

In [122]: timeit A.T@A                                                          
258 ms ± 523 µs per loop (mean ± std. dev. of 7 runs, 1 loop each)
In [123]: timeit AT_F@A                                                         
402 ms ± 2.66 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
In [124]: timeit AT_C@A                                                         
392 ms ± 9.16 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
In [125]: %%timeit x=A.T.copy(order='F') 
     ...: x@A                                                                       
410 ms ± 18.5 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

most interesting, so it has to do with the use of BLAS syrk, making what I read on the quoted link rather obsolete. — user37292, Nov 27 '19 at 23:27

Numpy optimisation - C and F_Contiguous Matrices

1 Answers1