3

Here I tried to self-explain the CUDA launch parameters model (or execution configuration model) using some pseudo codes, but I don't know if there were some big mistakes, So hope someone help to review it, and give me some advice. Thanks advanced.

Here it is:

/*
  normally, we write kernel function like this.
  note, __global__ means this function will be called from host codes,
  and executed on device. and a __global__ function could only return void.
  if there's any parameter passed into __global__ function, it should be stored
  in shared memory on device. so, kernel function is so different from the *normal*
  C/C++ functions. if I was the CUDA authore, I should make the kernel function more
  different  from a normal C function.
*/

__global__ void
kernel(float *arr_on_device, int n) {
        int idx = blockIdx.x * blockDIm.x + threadIdx.x;
        if (idx < n) {
                arr_on_device[idx] = arr_on_device[idx] * arr_on_device[idx];
        }
}

/*
  after this definition, we could call this kernel function in our normal C/C++ codes !!
  do you feel something wired ? un-consistant ?
  normally, when I write C codes, I will think a lot about the execution process down to
  the metal in my mind, and this one...it's like some fragile codes. break the sequential
  thinking process in my mind.
  in order to make things normal, I found a way to explain: I expand the *__global__ * function
  to some pseudo codes:
*/

#define __foreach(var, start, end) for (var = start, var < end; ++var)

__device__ int
__indexing() {
        const int blockId = blockIdx.x * gridDim.x + gridDim.x * gridDim.y * blockIdx.z;

        return 
                blockId * (blockDim.x * blockDim.y * blockDim.z) +
                threadIdx.z * (blockDim.x * blockDim.y) +
                threadIdx.x;
}

global_config =:
        {
                /*
                  global configuration.
                  note the default values are all 1, so in the kernel codes,
                  we could just ignore those dimensions.
                 */ 
                gridDim.x = gridDim.y = gridDim.z = 1;
                blockDim.x = blockDim.y = blockDim.z = 1;
        };

kernel =:
        {
                /*
                  I thought CUDA did some bad evil-detail-covering things here.
                  it's said that CUDA C is an extension of C, but in my mind,
                  CUDA C is more like C++, and the *<<<>>>* part is too tricky.
                  for example:
                  kernel<<<10, 32>>>(); means kernel will execute in 10 blocks each have 32 threads.

                  dim3 dimG(10, 1, 1);
                  dim3 dimB(32, 1, 1);
                  kernel<<<dimG, dimB>>>(); this is exactly the same thing with above.

                  it's not C style, and C++ style ? at first, I thought this could be done by
                  C++'s constructor stuff, but I checked structure *dim3*, there's no proper
                  constructor for this. this just brroke the semantics of both C and C++. I thought
                  force user to use *kernel<<<dim3, dim3>>>* would be better. So I'd like to keep
                  this rule in my future codes.
                */

                gridDim  = dimG;
                blockDim = dimB;

                __foreach(blockIdx.z,  0, gridDim.z)
                __foreach(blockIdx.y,  0, gridDim.y)
                __foreach(blockIdx.x,  0, gridDim.x)
                __foreach(threadIdx.z, 0, blockDim.z)
                __foreach(threadIdx.y, 0, blockDim.y)
                __foreach(threadIdx.x, 0, blockDim.x)
                {
                        const int idx = __indexing();        
                        if (idx < n) {
                                arr_on_device[idx] = arr_on_device[idx] * arr_on_device[idx];
                        }
                }
        };

/*
  so, for me, gridDim & blockDim is like some boundaries.
  e.g. gridDim.x is the upper bound of blockIdx.x, this is not that obvious for people like me.
 */

/* the declaration of dim3 from vector_types.h of CUDA/include */
struct __device_builtin__ dim3
{
        unsigned int x, y, z;
#if defined(__cplusplus)
        __host__ __device__ dim3(unsigned int vx = 1, unsigned int vy = 1, unsigned int vz = 1) : x(vx), y(vy), z(vz) {}
        __host__ __device__ dim3(uint3 v) : x(v.x), y(v.y), z(v.z) {}
        __host__ __device__ operator uint3(void) { uint3 t; t.x = x; t.y = y; t.z = z; return t; }
#endif /* __cplusplus */
};

typedef __device_builtin__ struct dim3 dim3;
blackball
  • 718
  • 1
  • 6
  • 19
  • 1
    There are many classes available, including introductory ones that you can watch any time and only take 1 hour, [here](https://developer.nvidia.com/gpu-computing-webinars). Try the `GPU Computing using CUDA C` series. – Robert Crovella Oct 08 '13 at 06:35
  • have you read what I explained below ? – blackball Oct 08 '13 at 08:42
  • You seem to be quite confused on grids and blocks let alone the underlying architecture which is pretty damn important. I signed up for the free online CUDA course at udacity and got to effective (if not really advanced) coding in a week. Check it out because GPU programming seems to require solid foundations. – Boyko Perfanov Oct 08 '13 at 17:28
  • @Boyko I've checked the course you suggested, especially the part explaining kernel launch configuration https://www.udacity.com/course/viewer#!/c-cs344/l-55120467/m-67074291 (begins in this clip and following 3 clips), the teacher still not make it clear how the GPU dispatches tasks to threads. Look at that kernel function, how does the compiler expand the codes ? So many people just were told to remember the rule, but how the rule was implemented in codes ? So now, I'm reading the driver API and OpenCL documents, just want to find out what happened under the hood. – blackball Oct 09 '13 at 07:40
  • Higher level languages are made so you DON'T have to worry about how code makes it into machine instructions on the CPU. And every new generation CPU brings something new, even though the instruction set is the same, what is ACTUALLY going on is not necessarily the same on different architectures. – Boyko Perfanov Oct 09 '13 at 14:39

3 Answers3

12

CUDA DRIVER API

The CUDA Driver API v4.0 and above uses the following functions to control a kernel launch:

cuFuncSetCacheConfig
cuFuncSetSharedMemConfig
cuLaunchKernel

The following CUDA Driver API functions were used prior to the introduction of cuLaunchKernel in v4.0.

cuFuncSetBlockShape()
cuFuncSetSharedSize()
cuParamSet{Size,i,fv}()
cuLaunch
cuLaunchGrid

Additional information on these functions can be found in cuda.h.

CUresult CUDAAPI cuLaunchKernel(CUfunction f,
    unsigned int gridDimX,
    unsigned int gridDimY,
    unsigned int gridDimZ,
    unsigned int blockDimX,
    unsigned int blockDimY,
    unsigned int blockDimZ,
    unsigned int sharedMemBytes,
    CUstream hStream,
    void **kernelParams,
    void **extra);

cuLaunchKernel takes as parameters the entire launch configuration.

See NVIDIA Driver API[Execution Control]1 for more details.

CUDA KERNEL LAUNCH

cuLaunchKernel will 1. verify the launch parameters 2. change the shared memory configuration 3. change the local memory allocation 4. push a stream synchronization token into the command buffer to make sure two commands in the stream do not overlap 4. push the launch parameters into the command buffer 5. push the launch command into the command buffer 6. submit the command buffer to the device (on wddm drivers this step may be deferred) 7. on wddm the kernel driver will page all memory required in device memory

The GPU will 1. verify the command 2. send the commands to the compute work distributor 3. dispatch launch configuration and thread blocks to the SMs

When all thread blocks have completed the work distributor will flush the caches to honor the CUDA memory model and it will mark the kernel as completed so the next item in the stream can make forward progress.

The order that thread blocks are dispatched differs between architectures.

Compute capability 1.x devices store the kernel parameters in shared memory. Compute capability 2.0-3.5 devices store the kenrel parameters in constant memory.

CUDA RUNTIME API

The CUDA Runtime is a C++ software library and build tool chain on top of the CUDA Driver API. The CUDA Runtime uses the following functions to control a kernel launch:

cudaConfigureCall cudaFuncSetCacheConfig cudaFuncSetSharedMemConfig cudaLaunch cudaSetupArgument

See NVIDIA Runtime API[Execution Control]2

The <<<>>> CUDA language extension is the most common method used to launch a kernel.

During compilation nvcc will create a new CPU stub function for each kernel function called using <<<>>> and it will replace the <<<>>> with a call to the stub function.

For example

__global__ void kernel(float* buf, int j)
{
    // ...
}

kernel<<<blocks,threads,0,myStream>>>(d_buf,j);

generates

void __device_stub__Z6kernelPfi(float *__par0, int __par1){__cudaSetupArgSimple(__par0, 0U);__cudaSetupArgSimple(__par1, 4U);__cudaLaunch(((char *)((void ( *)(float *, int))kernel)));}

You can inspect the generated files by adding --keep to your nvcc command line.

cudaLaunch calls cuLaunchKernel.

CUDA DYNAMIC PARALLELISM

CUDA CDP works similar to the CUDA Runtime API described above.

Greg Smith
  • 11,007
  • 2
  • 36
  • 37
3

By using <<<...>>>, you are launching a number of threads in the GPU. These threads are grouped into blocks and forms a large grid. All the threads will execute the invoked kernel function code.

In the kernel function, build-in variables like threadIdx and blockIdx enable the code know which thread it runs and do the scheduled part of the work.

edit

Basically, <<<...>>> simplifies the configuration procedure to launch a kernel. Without using it, one may have to call 4~5 APIs for a single kernel launch, just as the OpenCL way, which use only C99 syntax.

In fact you could check CUDA driver APIs. It may provide all those APIs so you don't need to use <<<>>>.

kangshiyin
  • 9,681
  • 1
  • 17
  • 29
  • I've read that doc long time ago, and some other related ones. I thought CUDA broke the semantics of C and C++ at the *kernel* function part, which makes it hard to understand, and easy to be confused. So, you said *By using <<<...>>>, you are launching...*, yes, you're right, like all those document told, but explaining CUDA execution configuration like that is just too bad. You just told how, but not *WHY*, oh, no, you just told *what*, even not *how*. – blackball Oct 08 '13 at 08:50
  • 1
    I've edited the answer. You may want to revise your question by adding the *WHY* and the *how* into it, to avoid misunderstanding of your question by **all** the audience. – kangshiyin Oct 08 '13 at 09:11
1

Basically, the GPU is divided into separate "device" GPUs (e.g. GeForce 690 has 2) -> multiple SM's (streaming multiprocessors) -> multiple CUDA cores. As far as I know, the dimensionality of a block or grid is just a logical assignment irrelevant of hardware, but the total size of a block (x*y*z) is very important.

Threads in a block HAVE TO be on the same SM, to use its facilities of shared memory and synchronization. So you cannot have blocks with more threads than CUDA cores are contained in a SM.

If we have a simple scenario where we have 16 SMs with 32 CUDA cores each, and we have 31x1x1 block size, and 20x1x1 grid size, we will forfeit at least 1/32 of the processing power of the card. Every time a block is run, a SM will have only 31 of its 32 cores busy. Blocks will load to fill up the SMs, we will have 16 blocks finish at roughly the same time, and as the first 4 SMs free up, they will start processing the last 4 blocks (NOT necessarily blocks #17-20).

Comments and corrections are welcome.

Boyko Perfanov
  • 3,007
  • 18
  • 34
  • I think the part I want to figure out is: how the CUDA compiler parses the kernel codes, and dispatches the tasks to the threads in device. In my original post, I *imagined* a way to self-explain the mechanism. – blackball Oct 09 '13 at 15:51