Why one code (matmul) is faster than the other (Python)

Question

The problem deals with basic matrix operation. In the following code, c1 essentially equals c2. However, the first way of computing is much faster than the second one. In fact, at first I thought the first way needs to allocate a b matrix that is twice larger than the a matrix, hence may be slower. It turns out to be the opposite. Why?

import time
import numpy as np
a = np.random.rand(20000,100)+np.random.rand(20000,100)*1j

tic = time.time()
b = np.vstack((a.real,a.imag))
c1 = b.T @ b
t1 = time.time()-tic

tic = time.time()
c2 = a.real.T @ a.real+a.imag.T@a.imag
t2 = time.time()-tic

print('t1=%f. t2=%f.'%(t1,t2))

An example result is

t1=0.037965. t2=4.375873.

I replaced the numpy.matmul @ sign with numpy.dot and had significant speed improvements. — Sohrab T, Apr 17 '20 at 07:41
@SohrabT I tried it and, yes, np.dot is faster indeed. It confuses me because the document [link] (https://docs.scipy.org/doc/numpy/reference/generated/numpy.dot.html) says when the inputs are 2d arrays, the use of '@' is prefered. — George C, Apr 17 '20 at 19:28

Ehsan · Accepted Answer · 2020-04-18T12:29:58.533

2

a.real and a.imag are in place accesses while np.vstack creates a new copy. The way @ operator (matmul()) deals with a.real and a.imag takes longer time. To make it faster, you can create a copy of each and then pass it to @ or use np.dot(a.real.T, a.real) and np.dot(a.imag.T, a.imag) (I am not sure of each implementation in BLAS).

For large matrices, the first method in the following code should still be slightly faster:

a = np.random.rand(20000,100)+np.random.rand(20000,100)*1j

tic = time.time()
b = np.vstack((a.real,a.imag))
c1 = b.T @ b
t1 = time.time()-tic

tic = time.time()
b = a.real.copy()
c = a.imag.copy()
c2 = b.T @ b + c.T @ c
t2 = time.time()-tic

print('t1=%f. t2=%f.'%(t1,t2))

Output:

t1=0.031620. t2=0.021769.

EDIT: Dive deeper:

Let's take a quick look at various memory layouts:

a = np.random.rand(20000,100)+np.random.rand(20000,100)*1j
print('a.flags\n', a.flags)
print('a.real flags\n', a.real.flags)


a.flags
C_CONTIGUOUS : True
F_CONTIGUOUS : False
OWNDATA : True
...
a.real flags
C_CONTIGUOUS : False
F_CONTIGUOUS : False
OWNDATA : False

So a is C_CONTIGUOUS and a.real is not. I am not sure how @ implements the calculations, but my guess is it is different with cache tricks and strides and unrolled loops. I will leave that to experts to explain. Now, array.copy() by default is C_CONTIGUOUS (BE CAREFUL: np.copy() is not C_CONTIGUOUS by default.) and that is why the second approach above is as fast as first one (where b is also C_CONTIGUOUS).

@George C: I think the first one is slightly faster because np.vstack creates a new C_CONTIGUOUS object that can leverage cache tricks in one place, while in the second approach the output of a.real.T @ a.real and a.imag.T@a.imag are at different locations of memory and require extra effort to calculate. Here is a link to more explanation.
Disclaimer: Any expert that knows the details of NumPy implementation are welcome to edit this post.

edited Apr 18 '20 at 12:29

answered Apr 17 '20 at 06:58

Ehsan

12,072
2
20
33

2

Good observation. `a.real` is not `C contiguous`, and `matmul` appears to choose a much slower evaluation method. (maybe not even using the fast BLAS library). `np.dot` does not have this problem. Your use of `copy` restores the contiguity, and the fast method. I wonder if there's a bug issue about this. – hpaulj Apr 17 '20 at 07:59
If I use `b = a.view(float)[:, ::2]` I get the same times as `b = a.real`. It's not the access time itself that's the problem, but the decision within `matmul` to use an alternative evaluation method. – hpaulj Apr 17 '20 at 08:10
It is correct that accessing `real` and `imag` parts are not what takes longer time. It is that you pass them to `@` vs. passing a `.copy()`. I am guessing the way `@` accesses arrays would be different (and that is what I meant by accessing). If you run the same code without `.copy()` it takes longer time. I will try to edit the writing. Thank you. – Ehsan Apr 17 '20 at 08:22
@hpauljI have tried several different cases. Suppose x and y are two complex square matrices of the same size. Then, in the following multiplications: x@y, x.T@y, x.T@y.T, x.real@y, x@y.real, x.T@y.real, x.real@y.real, only the last one, i.e., x.real@y.real, that the '@' operator will take a much longer time than np.dot. Since matrix transpose also makes the array non C-contiguous, I am supposing '@' has something wrong when both sides have .real or .imag. – George C Apr 17 '20 at 19:24
@Ehsan Can you explain what makes you believe that the first method in your code should still be slightly faster? Thanks. – George C Apr 17 '20 at 19:35
@GeorgeC Please checkout my edit on the post. Let me know if it answers your question. Thank you. – Ehsan Apr 17 '20 at 23:10
@Ehsan I now understand why you think the first implementation will be somehow faster. Thanks. But for the results of raveling, I am not sure if the comparison is fair, because the new c2 is a scalar, while in my original problem the output should be a matrix. – George C Apr 17 '20 at 23:42

Why one code (matmul) is faster than the other (Python)

1 Answers1