I try to find a fast way to calculate a dot product of two complex64 matrices in python.
B = np.complex64(np.random.rand(3014, 4) + 1j*np.random.rand(3014, 4))
A = np.complex64(np.random.rand(32, 3014) + 1j*np.random.rand(32, 3014))
Numpy achieves the best result:
%timeit np.dot(A, B)
10000 loops, best of 3: 165 µs per loop
Scipy is the same:
%timeit scipy.dot(A, B)
10000 loops, best of 3: 165 µs per loop
Numpy Einsum is much much slower:
%timeit np.einsum('ij, jk ->ik ', A, B)
1000 loops, best of 3: 1.2 ms per loop
Writing the dot product out in for loops (from Comparing Python, Numpy, Numba and C++ for matrix multiplication ) is not much better (using my naive Numba implementation with just the @jit() decorator )
@jit()
def dot_py(A,B):
m, n = A.shape
p = B.shape[1]
C = np.zeros((m,p), dtype=complex)
for i in range(0,m):
for j in range(0,p):
for k in range(0,n):
C[i,j] += A[i,k]*B[k,j]
return C
The slowest run took 153.35 times longer than the fastest. This could mean that an intermediate result is being cached.
1000 loops, best of 3: 1.02 ms per loop
Using Numba in general never improved the speed of any Numpy-related function I ever used (I might be doing something wrong?)
Only with Theano I come close:
import theano.tensor as T
x = T.cmatrix( 'x')
y = T.cmatrix( 'y')
z = theano.tensor.dot(x, y)
resultdot = theano.function([x, y], z)
%timeit resultdot(A, B)
10000 loops, best of 3: 182 µs per loop
numexpr does not apply here, does it?
Using a function that applies np.dot to blocks that are explicitly read into core memory from the memory-mapped array (from Efficient dot products of large memory-mapped arrays) does not result in a great speed-up:
def _block_slices(dim_size, block_size):
"""Generator that yields slice objects for indexing into
sequential blocks of an array along a particular axis
"""
count = 0
while True:
yield slice(count, count + block_size, 1)
count += block_size
if count > dim_size:
raise StopIteration
def blockwise_dot(A, B, max_elements=int(2**27), out=None):
"""
Computes the dot product of two matrices in a block-wise fashion.
Only blocks of `A` with a maximum size of `max_elements` will be
processed simultaneously.
"""
m, n = A.shape
n1, o = B.shape
if n1 != n:
raise ValueError('matrices are not aligned')
if A.flags.f_contiguous:
# prioritize processing as many columns of A as possible
max_cols = max(1, max_elements / m)
max_rows = max_elements / max_cols
else:
# prioritize processing as many rows of A as possible
max_rows = max(1, max_elements / n)
max_cols = max_elements / max_rows
if out is None:
out = np.empty((m, o), dtype=np.result_type(A, B))
elif out.shape != (m, o):
raise ValueError('output array has incorrect dimensions')
for mm in _block_slices(m, max_rows):
out[mm, :] = 0
for nn in _block_slices(n, max_cols):
A_block = A[mm, nn].copy() # copy to force a read
out[mm, :] += np.dot(A_block, B[nn, :])
del A_block
return out
%timeit blockwise_dot(A, B, max_elements=int(2**27), out=None)
1000 loops, best of 3: 252 µs per loop
In short: Is there no faster way using any Python library to calculate the dot-product? It does not even use multiple cores, htop shows that only one core is active during the Numpy calculation. Shouldn't this be different when linked against BLAS etc. ?