PyCUDA kernel function

Question

I am new to PyCUDA and was going through some of the examples on the PyCUDA website. I am trying to figure out the logic behind certain lines of code and would really appreciate if someone explained the idea behind it.

The below code snippet is from the PyCUDA website. Inside the function definition, I don't understand

int idx = threadIdx.x + threadIdx.y*4;

how the above line is being used to calculate the index of the array. Why are threadIdx.x and threadIdx.y added together and why is threadIdx.y multiplied by 4.

For the function call to the GPU why is the block defined as 5,5,1. Since it is an array of 5x5 elements so in my understanding the block size should be 5,5 instead of 5,5,1.

import pycuda.driver as cuda
import pycuda.autoinit
from pycuda.compiler import SourceModule
import numpy
a = numpy.random.randn(5,5)
a = a.astype(numpy.float32)
a_gpu = cuda.mem_alloc(a.nbytes)
cuda.memcpy_htod(a_gpu, a)
mod = SourceModule("""
__global__ void doubleMatrix(float *a)
{
int idx = threadIdx.x + threadIdx.y*4;
a[idx] *= 2;
}
""")
func = mod.get_function("doubleMatrix")
func(a_gpu, block=(5,5,1))
a_doubled = numpy.empty_like(a)
cuda.memcpy_dtoh(a_doubled, a_gpu)
print ("ORIGINAL MATRIX")
print a
print ("DOUBLED MATRIX AFTER PyCUDA EXECUTION")
print a_doubled

score 3 · Answer 1 · edited May 23 '17 at 10:31

The example you have posted seems to have come from (or been plagerised into) a book called "Python Parallel Programming Cookbook", which I hadn't heard of until five minutes ago. Honestly, if I were the authors of that book I would be ashamed to have included such a hacky, broken example.

Here is a small modification to what you posted, and its output:

import pycuda.driver as cuda
import pycuda.autoinit
from pycuda.compiler import SourceModule
import numpy
a = numpy.random.randn(5,5)
a = a.astype(numpy.float32)
a_gpu = cuda.mem_alloc(a.nbytes)
cuda.memcpy_htod(a_gpu, a)
mod = SourceModule("""
__global__ void doubleMatrix(float *a)
{
     int idx = threadIdx.x + threadIdx.y*4;
     a[idx] *= 2.f;
}
""")
func = mod.get_function("doubleMatrix")
func(a_gpu, block=(5,5,1))
a_doubled = numpy.empty_like(a)
cuda.memcpy_dtoh(a_doubled, a_gpu)
print a_doubled - 2.0*a

[warning: Python 2 syntax]

In [2]: %run matdouble.py
[[ 0.          0.          0.          0.          0.        ]
 [ 0.          0.          0.          0.          0.        ]
 [ 0.          0.          0.          0.          0.        ]
 [ 0.          0.          0.          0.          0.        ]
 [ 0.         -0.62060976  0.49836278 -1.60820103  1.71903515]]

i.e. the code doesn't work as expected, and this is probably your source of confusion.

The correct way to address a multidimensional array stored in linear memory (as numpy arrays are) is described in this very recent answer. Any sensible programmer would write the kernel in your example something like this:

__global__ void doubleMatrix(float *a, int lda)
{
     int idx = threadIdx.x + threadIdx.y * lda;
     a[idx] *= 2.f;
}

so that the leading dimension of the array is passed as an argument to the kernel (which in this case should be 5, not 4). Doing this gives the following:

import pycuda.driver as cuda
import pycuda.autoinit
from pycuda.compiler import SourceModule
import numpy
a = numpy.random.randn(5,5)
a = a.astype(numpy.float32)
a_gpu = cuda.mem_alloc(a.nbytes)
cuda.memcpy_htod(a_gpu, a)
mod = SourceModule("""
__global__ void doubleMatrix(float *a, int lda)
{
     int idx = threadIdx.x + threadIdx.y * lda;
     a[idx] *= 2.f;
}
""")
func = mod.get_function("doubleMatrix")
lda = numpy.int32(a.shape[-1])
func(a_gpu, lda, block=(5,5,1))
a_doubled = numpy.empty_like(a)
cuda.memcpy_dtoh(a_doubled, a_gpu)
print a_doubled - 2.0*a

which produces the expected result:

In [3]: %run matdouble.py
[[ 0.  0.  0.  0.  0.]
 [ 0.  0.  0.  0.  0.]
 [ 0.  0.  0.  0.  0.]
 [ 0.  0.  0.  0.  0.]
 [ 0.  0.  0.  0.  0.]]

For the function call to the GPU why is the block defined as 5,5,1. Since it is an array of 5x5 elements so in my understanding the block size should be 5,5 instead of 5,5,1.

In CUDA, all blocks implictly have three dimensions. A block size of (5,5) is the same as a block size of (5,5,1). The last dimension can be ignored because it is one (i.e. all threads in the block have threadIdx.z = 1). What you shouldn't fall into the trap of doing is conflating the dimensions of the CUDA block or grid with the dimensions of the input array. Sometimes it is convenient to have them be the same, but equally it isn't necessary or even advisable to do so. A correctly written kernel in the style of BLAS for this example (assuming row major storage order) would probably look like this:

__global__ void doubleMatrix(float *a, int m, int n, int lda)
{
     int col = threadIdx.x + blockIdx.x * blockDim.x;
     int row = threadIdx.y + blockDim.y * blockDim.y;

     for(; row < m; row += blockDim.y * gridDim.y) {
         for(; col < n; col += blockDim.x * gridDim.x) {
             int idx = col + row * lda;
             a[idx] *= 2.f;
         }
    }
}

[Note: written in browser, not compiled or tested]

Here any legal block and grid dimension will correctly process any sized input array total number of elements will fit into a signed 32 bit integer. If you run too many threads, some will do nothing. If you run too few threads, some will process multiple array elements. If you run a grid with the same dimensions as the input array, each thread will process exactly one input, as was the intention in the example you were studying. If you want to read about how to choose the most appropriate block and grid sizes, I suggest starting here.

PyCUDA kernel function

1 Answers1