a.real
and a.imag
are in place accesses while np.vstack
creates a new copy. The way @
operator (matmul()
) deals with a.real
and a.imag
takes longer time. To make it faster, you can create a copy of each and then pass it to @
or use np.dot(a.real.T, a.real)
and np.dot(a.imag.T, a.imag)
(I am not sure of each implementation in BLAS).
For large matrices, the first method in the following code should still be slightly faster:
a = np.random.rand(20000,100)+np.random.rand(20000,100)*1j
tic = time.time()
b = np.vstack((a.real,a.imag))
c1 = b.T @ b
t1 = time.time()-tic
tic = time.time()
b = a.real.copy()
c = a.imag.copy()
c2 = b.T @ b + c.T @ c
t2 = time.time()-tic
print('t1=%f. t2=%f.'%(t1,t2))
Output:
t1=0.031620. t2=0.021769.
EDIT: Dive deeper:
Let's take a quick look at various memory layouts:
a = np.random.rand(20000,100)+np.random.rand(20000,100)*1j
print('a.flags\n', a.flags)
print('a.real flags\n', a.real.flags)
a.flags
C_CONTIGUOUS : True
F_CONTIGUOUS : False
OWNDATA : True
...
a.real flags
C_CONTIGUOUS : False
F_CONTIGUOUS : False
OWNDATA : False
So a
is C_CONTIGUOUS
and a.real
is not. I am not sure how @
implements the calculations, but my guess is it is different with cache tricks and strides and unrolled loops. I will leave that to experts to explain. Now, array.copy()
by default is C_CONTIGUOUS
(BE CAREFUL: np.copy()
is not C_CONTIGUOUS
by default.) and that is why the second approach above is as fast as first one (where b
is also C_CONTIGUOUS
).
@George C: I think the first one is slightly faster because np.vstack
creates a new C_CONTIGUOUS
object that can leverage cache tricks in one place, while in the second approach the output of a.real.T @ a.real
and a.imag.T@a.imag
are at different locations of memory and require extra effort to calculate. Here is a link to more explanation.
Disclaimer: Any expert that knows the details of NumPy implementation are welcome to edit this post.