cuda python GPU numbapro 3d loop poor performance

Question

I am trying to set up a 3D loop with the assignment

 C(i,j,k) = A(i,j,k) + B(i,j,k)

using Python on my GPU. This is my GPU:

http://www.geforce.com/hardware/desktop-gpus/geforce-gt-520/specifications

The sources I'm looking at / comparing with are:

http://nbviewer.ipython.org/gist/harrism/f5707335f40af9463c43

http://nbviewer.ipython.org/github/ContinuumIO/numbapro-examples/blob/master/webinars/2014_06_17/intro_to_gpu_python.ipynb

It's possible that I've imported more modules than necessary. This is my code:

import numpy as np
import numbapro
import numba
import math
from timeit import default_timer as timer
from numbapro import cuda
from numba import *

@autojit
def myAdd(a, b):
  return a+b

myAdd_gpu = cuda.jit(restype=f8, argtypes=[f8, f8], device=True)(myAdd)

@cuda.jit(argtypes=[float32[:,:,:], float32[:,:,:], float32[:,:,:]])
def myAdd_kernel(a, b, c):
    tx = cuda.threadIdx.x
    ty = cuda.threadIdx.y
    tz = cuda.threadIdx.z
    bx = cuda.blockIdx.x
    by = cuda.blockIdx.y
    bz = cuda.blockIdx.z
    bw = cuda.blockDim.x
    bh = cuda.blockDim.y
    bd = cuda.blockDim.z
    i = tx + bx * bw
    j = ty + by * bh
    k = tz + bz * bd
    if i >= c.shape[0]:
      return
    if j >= c.shape[1]:
      return
    if k >= c.shape[2]:
      return
    for i in xrange(0,c.shape[0]):
      for j in xrange(0,c.shape[1]):
        for k in xrange(0,c.shape[2]):
          # c[i,j,k] = a[i,j,k] + b[i,j,k]
          c[i,j,k] = myAdd_gpu(a[i,j,k],b[i,j,k])

def main():
    my_gpu = numba.cuda.get_current_device()
    print "Running on GPU:", my_gpu.name
    cores_per_capability = {1: 8,2: 32,3: 192,}
    cc = my_gpu.compute_capability
    print "Compute capability: ", "%d.%d" % cc, "(Numba requires >= 2.0)"
    majorcc = cc[0]
    print "Number of streaming multiprocessor:", my_gpu.MULTIPROCESSOR_COUNT
    cores_per_multiprocessor = cores_per_capability[majorcc]
    print "Number of cores per mutliprocessor:", cores_per_multiprocessor
    total_cores = cores_per_multiprocessor * my_gpu.MULTIPROCESSOR_COUNT
    print "Number of cores on GPU:", total_cores

    N = 100
    thread_ct = my_gpu.WARP_SIZE
    block_ct = int(math.ceil(float(N) / thread_ct))

    print "Threads per block:", thread_ct
    print "Block per grid:", block_ct

    a = np.ones((N,N,N), dtype = np.float32)
    b = np.ones((N,N,N), dtype = np.float32)
    c = np.zeros((N,N,N), dtype = np.float32)

    start = timer()
    cg = cuda.to_device(c)
    myAdd_kernel[block_ct, thread_ct](a,b,cg)
    cg.to_host()
    dt = timer() - start
    print "Wall clock time with GPU in %f s" % dt
    print 'c[:3,:,:] = ' + str(c[:3,1,1])
    print 'c[-3:,:,:] = ' + str(c[-3:,1,1])


if __name__ == '__main__':
    main()

My result from running this is the following:

Running on GPU: GeForce GT 520
Compute capability:  2.1 (Numba requires >= 2.0)
Number of streaming multiprocessor: 1
Number of cores per mutliprocessor: 32
Number of cores on GPU: 32
Threads per block: 32
Block per grid: 4
Wall clock time with GPU in 1.104860 s
c[:3,:,:] = [ 2.  2.  2.]
c[-3:,:,:] = [ 2.  2.  2.]

When I run the examples in the sources, I see significant speedup. I don't think my example is running properly since the wall clock time is much longer than I would expect. I've modeled this mostly from the "even bigger speedups with cuda python" section in the first example link.

I believe I've indexed correctly and safely. Maybe the problem is with my blockdim? or griddim? Or maybe I'm using the wrong types for my GPU. I think I read that they must be a certain type. I'm very new to this so the problem very well could be trivial!

Any and all help is greatly appreciated!

score 4 · Accepted Answer · answered Jan 04 '15 at 14:10

You are creating your indexes correctly but then you're ignoring them. Running the nested loop

for i in xrange(0,c.shape[0]):
    for j in xrange(0,c.shape[1]):
        for k in xrange(0,c.shape[2]):

is forcing all your threads to loop through all values in all dimensions, which is not what you want. You want each thread to compute one value in a block and then move on.

I think something like this should work better...

i = tx + bx * bw
while i < c.shape[0]:
    j = ty+by*bh
    while j < c.shape[1]:
        k = tz + bz * bd
        while k < c.shape[2]:
            c[i,j,k] = myAdd_gpu(a[i,j,k],b[i,j,k])
            k+=cuda.blockDim.z*cuda.gridDim.z
        j+=cuda.blockDim.y*cuda.gridDim.y
    i+=cuda.blockDim.x*cuda.gridDim.x

Try to compile and run it. Also make sure to validate it, as I have not.

score -1 · Answer 2 · answered Jan 02 '15 at 19:57

-1

I don't see you using imshow, or show, so there is no need to import those.

It doesn't appear as though you use your import of math (I didn't see any calls of math.some_function.

Your imports from numba and numbapro seem repetitive. Your "from numba import cuda" overrides your "from numbapro import cuda", since it is subsequent to it. Your calls to cuda use the cuda in numba not numbapro. When you call "from numba import *", you import everything from numba, not just cuda, which seems to be the only thing you use. Also, (I believe) import numba.cuda is equivalent to from numba import cuda. Why not eliminate all your imports from numba and numbapro with a single "from numba import cuda".

answered Jan 02 '15 at 19:57

user1245262

6,968
8
50
77

@Charlie - Sorry, I should have prefaced this with the remark that I was focusing on your comment about over importing modules. Perhaps, this really should have just been a comment, but it seemed a little long for one. – user1245262 Jan 02 '15 at 22:49
I realized that what I had just now was not working, so I put some of the imports back in. I know that it looks redundant, but there is nothing wrong with the imports. – Charles Jan 02 '15 at 23:29

cuda python GPU numbapro 3d loop poor performance

2 Answers2