3

I try to find a fast way to calculate a dot product of two complex64 matrices in python.

B = np.complex64(np.random.rand(3014, 4) + 1j*np.random.rand(3014, 4))
A = np.complex64(np.random.rand(32, 3014) + 1j*np.random.rand(32, 3014))

Numpy achieves the best result:

%timeit np.dot(A, B)
10000 loops, best of 3: 165 µs per loop

Scipy is the same:

%timeit scipy.dot(A, B)
10000 loops, best of 3: 165 µs per loop

Numpy Einsum is much much slower:

%timeit np.einsum('ij, jk ->ik   ', A, B)
1000 loops, best of 3: 1.2 ms per loop

Writing the dot product out in for loops (from Comparing Python, Numpy, Numba and C++ for matrix multiplication ) is not much better (using my naive Numba implementation with just the @jit() decorator )

@jit()
def dot_py(A,B):
    m, n = A.shape
    p = B.shape[1]

    C = np.zeros((m,p), dtype=complex)

    for i in range(0,m):
        for j in range(0,p):
            for k in range(0,n):
                C[i,j] += A[i,k]*B[k,j] 
    return C


The slowest run took 153.35 times longer than the fastest. This could mean that an intermediate result is being cached.
1000 loops, best of 3: 1.02 ms per loop

Using Numba in general never improved the speed of any Numpy-related function I ever used (I might be doing something wrong?)

Only with Theano I come close:

import theano.tensor as T
x = T.cmatrix( 'x')
y = T.cmatrix( 'y')
z = theano.tensor.dot(x, y)
resultdot =  theano.function([x, y], z)
%timeit resultdot(A, B) 

10000 loops, best of 3: 182 µs per loop

numexpr does not apply here, does it?

Using a function that applies np.dot to blocks that are explicitly read into core memory from the memory-mapped array (from Efficient dot products of large memory-mapped arrays) does not result in a great speed-up:

def _block_slices(dim_size, block_size):
    """Generator that yields slice objects for indexing into 
    sequential blocks of an array along a particular axis
    """
    count = 0
    while True:
        yield slice(count, count + block_size, 1)
        count += block_size
        if count > dim_size:
            raise StopIteration

def blockwise_dot(A, B, max_elements=int(2**27), out=None):
    """
    Computes the dot product of two matrices in a block-wise fashion. 
    Only blocks of `A` with a maximum size of `max_elements` will be 
    processed simultaneously.
    """

    m,  n = A.shape
    n1, o = B.shape

    if n1 != n:
        raise ValueError('matrices are not aligned')

    if A.flags.f_contiguous:
        # prioritize processing as many columns of A as possible
        max_cols = max(1, max_elements / m)
        max_rows =  max_elements / max_cols

    else:
        # prioritize processing as many rows of A as possible
        max_rows = max(1, max_elements / n)
        max_cols =  max_elements / max_rows

    if out is None:
        out = np.empty((m, o), dtype=np.result_type(A, B))
    elif out.shape != (m, o):
        raise ValueError('output array has incorrect dimensions')

    for mm in _block_slices(m, max_rows):
        out[mm, :] = 0
        for nn in _block_slices(n, max_cols):
            A_block = A[mm, nn].copy()  # copy to force a read
            out[mm, :] += np.dot(A_block, B[nn, :])
            del A_block

    return out




%timeit blockwise_dot(A, B, max_elements=int(2**27), out=None)
1000 loops, best of 3: 252 µs per loop

In short: Is there no faster way using any Python library to calculate the dot-product? It does not even use multiple cores, htop shows that only one core is active during the Numpy calculation. Shouldn't this be different when linked against BLAS etc. ?

Community
  • 1
  • 1
BPresent
  • 103
  • 1
  • 8
  • 1
    Try using `@jit(nopython=True)`. If it compiles and works properly it should be pretty fast but probably slower than `np.dot`. – pbreach May 10 '16 at 12:02
  • My answer in [Efficient dot products of large memory-mapped arrays](http://stackoverflow.com/questions/20983882/efficient-dot-products-of-large-memory-mapped-arrays) would only be relevant if your arrays are memory-mapped. – ali_m May 10 '16 at 12:07
  • I indeed tried this, but it seems to have problems with the complex numbers? `TypingError: Invalid usage of Function() with parameters ((int64 x 2), Function()) * parameterized` – BPresent May 10 '16 at 12:09
  • Is it this `blas_opt_info: define_macros = [('SCIPY_MKL_H', None), ('HAVE_CBLAS', None)] libraries = ['mkl_intel_lp64', 'mkl_intel_thread', 'mkl_core', 'iomp5', 'pthread']` or `blas_mkl_info: define_macros = [('SCIPY_MKL_H', None), ('HAVE_CBLAS', None)] libraries = ['mkl_intel_lp64', 'mkl_intel_thread', 'mkl_core', 'iomp5', 'pthread']` ? – BPresent May 10 '16 at 12:11
  • Indeed, numpy was installed on Linux via Anaconda including mkl. I am just surprised that still only one core is used. Could it be that mkl decide show many threads are best (even if it is just one)? From: https://www.continuum.io/blog/developer-blog/anaconda-25-release-now-mkl-optimizations `import mkl mkl.get_max_threads() Out[316]: 4 mkl.set_num_threads(4)` It still uses only one core though. Yes the laptop has a (weak) CUDA-enabled gpu. – BPresent May 10 '16 at 12:50
  • The Accelerate library by Continuum Analytics is installed with an academic licence. Is it therefore not possible to have faster dot products on a cpu? Even Theano (which I made sure to use the gpu instead of the cpu) is slower, but my gpu is not the fastest, as mentioned. This would mean that even gpus will not proivde more speed gains? In a profiled function the large number of dot-products is the bottle-neck. – BPresent May 10 '16 at 13:14
  • Just as a comparison: I tried the same in Julia without any optimizations - Numpy is still faster: `julia> @timeit *(A, B) 1000 loops, best of 3: 259.86 µs per loop 0.000259861893 ` – BPresent May 10 '16 at 13:32
  • The output is `theano.config.device Out: 'gpu'`, The .theanorc file was changed so it uses the gpu. I am not sure if I already hit the hardware limit or if some software optimizations are possible. Many examples on the web can be found where people claim that their method is much faster than Numpy, however, in this case it seems not to be reproducible? – BPresent May 10 '16 at 14:36
  • 1
    You'll have to get more specific about these claims. As a rule of thumb, I'd say that an optimized multithreaded BLAS library such as OpenBLAS or MKL would be very hard to beat for a single complex dot product on the CPU. Did you install the MKL Optimizations package? – ali_m May 10 '16 at 15:40
  • The things I find on the web don't refer to the dot-product but other functions, i.e. this is a apples vs. oranges comparison, this is true. `sudo conda install mkl # All requested packages already installed.` The np.show_config() suggests that it is installed, is this correct? Can I check which of the BLAS libraries are actually used? One option is to try to get access to a more powerful gpu to test the Theano performance which is already almost as good as Numpy's on my GeForce GT 750M mobile chip. My takeaway would be that on this cpu Numpy + MKL acceleration is hard to beat. – BPresent May 10 '16 at 16:30

0 Answers0