5

I am trying to get fast computations of matrices with anaconda accelerate. I started with very basic example: multiply 2 matrices.

My goal is to somehow get GPU-multiplication which is better than usual numpy.dot

Here is my basic example, based on this documentation.

from numbapro import guvectorize
from numpy import arange

@guvectorize(['void(float32[:,:], float32[:,:], float32[:,:])'], '(m,n),(n,p)->(m,p)', target='gpu')
def matmul(A, B, C):
    m, n = A.shape
    n, p = B.shape
    for i in range(m):
        for j in range(p):
            C[i, j] = 0
            for k in range(n):
                C[i, j] += A[i, k] * B[k, j]

import numpy as np
import time

for dim in [50, 100, 200]:
    rnd = np.random.RandomState(0)
    a = rnd.rand(dim, dim).astype(np.float32)
    b = rnd.rand(dim, dim).astype(np.float32)
    resgpu = np.zeros_like(a)

    start = time.time()
    rescpu = np.dot(a, b)
    print('CPU:', time.time() - start)

    start = time.time()
    resgpu = matmul(a, b)
    print('GPU:', time.time() - start)

    print(np.allclose(rescpu, resgpu))
    print(np.allclose(resgpu, rescpu))

Results are too bad: GPU is incredibly slower than CPU

CPU: 0.00011801719665527344
GPU: 0.05677294731140137
True
True
CPU: 0.00011205673217773438
GPU: 0.3881375789642334
True
True
CPU: 0.00038933753967285156
GPU: 3.018171787261963
True
True

Of course I understand that internal numpy realization is well optimized, but I expected anaconda official example to be good. I am using python 3.4.3 and got errors with using these two helping libs: http://www.cs.toronto.edu/~tijmen/gnumpy.html and https://github.com/rctn/gpupy

I should say that with gpupy I had successful speedup on python 2.7.

So my question is: how can I get matrix multiplication better than numpy-CPU by using GPU? What is wrong with anaconda official example and if there a working library for python3 that allows to use GPU in numpy way?

===

RESULTS

Unfortunately, there is no simple and good way for python 3, use 2.7 instead

Thanks to @rth for recommendint awesome library scikits.cuda

Available functions

Some benchmark (tested with using anaconda mkl, so numpy is fast too)

dim = 10000
rnd = np.random.RandomState(0)
a = rnd.rand(dim, dim).astype(np.float32)
b = rnd.rand(dim, dim).astype(np.float32)
a_gpu = gpuarray.to_gpu(a)
b_gpu = gpuarray.to_gpu(b)

start = time.time()
rescpu = np.dot(a, b)
print 'CPU:', time.time() - start

start = time.time()
resgpu = culinalg.dot(a_gpu, b_gpu)
print 'GPU:', time.time() - start

resgpu = resgpu.get()
print np.allclose(rescpu, resgpu)
print np.allclose(resgpu, rescpu)

And results

CPU: 16.4765479565
GPU: 0.000520944595337
budmitr
  • 73
  • 1
  • 1
  • 7
  • 1
    Note, that copying the data to and from GPU is costly. So `gpuarray.to_gpu` should be inside the timed section, and then I suppose the GPU time will be a bit higher. – rth Jun 15 '15 at 19:48
  • Looking at your timings, I suspect the culinalg.dot is running asynchronously, i.e. you need to synchronize(), otherwise the GPU is still calculating during your print 'GPU:' statement. With your sizes, I get a ~5 times improvement on the GPU (Tesla K40, runs in 600 ms WITHOUT data transfers) compared to CPU (MKL, 28 cores). As opposed to your setup, I'm using numbapro.cudalib.Blas and do numba.cuda.synchronize(). – Fermion Portal Jan 06 '16 at 18:55

1 Answers1

3

You should have a look at BLAS implementations that provide highly optimized routines for classical linear algebra operations. The multiplication of dense matrices is performed with the gemm function.

  • For instance, matrix multiplication in numpy is significantly improved if it is compiled against an optimized BLAS implementation (OpenBLAS, ATLAS, MKL, etc).
  • For GPU, NVIDIA provides the cuBLAS implementation. According to this answer, it can be called with numpy arrays using scikits.cuda module. Anaconda accelerate that you are using, also provides direct binding to cuBLAS.

BTW, if you want to benchmark CPU vs GPU performance for matrix multiplication, you should also specify the BLAS used by Numpy for the CPU calculations, since the results could differ by an order of magnitude (see this benchmark).

rth
  • 10,680
  • 7
  • 53
  • 77
  • well, there is no dgemm function in numbapro cublas (http://docs.continuum.io/numbapro/cudalib.html), some similar names, but they deal with diagonal matrices and this is kinda not what i need. I tried scikits.cuda, but again, as previously noted libs (like gpupy), there are unsupported python3 http://scikit-cuda.readthedocs.org/en/latest/install.html#installation-dependencies. Seems like I need to use python 2 anyway, python 3 is still raw at year 2015... – budmitr Jun 15 '15 at 15:42
  • @Apfel See `numbapro.cudalib.cublas.Blas.gemm` in your comments link: that should work. `d` in `dgemm` stands for `double` so they probably dropped it. There is no reason why it wouldn't work with python 3. Apart of a few exceptions, most serious modules support both 3.x and 2.7 at this point. – rth Jun 15 '15 at 15:52
  • 1
    well, after I tested scikits.cuda on python 2.7 I am happy like a crazy. It does outperforms numpy in a great way with very cute syntax, like pure numpy. This library has active support and moreove, it does outperform gpupy which I tested before. Number of available math functions is also nice (http://scikit-cuda.readthedocs.org/en/latest/reference.html#reference). Thank you a lot! this is kinda what I looked for. The only thing is that python 3 is unusable for now – budmitr Jun 15 '15 at 19:31
  • You are right, [install.rst](https://github.com/lebedov/scikits.cuda/blob/master/docs/source/install.rst) states "some parts of the package may not work with Python 3". Still Python 3 should be supported overall, and if you experience an issue with a particular function in Python3 in `scikits.cuda`, it might be worth opening a bug at [github](https://github.com/lebedov/scikits.cuda/issues), so somebody can fix it. – rth Jun 15 '15 at 19:40