CUDA more than max threads without errors?

Question

The original problem was launching more threads that it is possible like this:

someKernel<<<1 , 1025>>> ( ... );

and not detecting the error, as I did not know how to detect kernel call errors. This is explained well in talonmies answer in this question:

What is the canonical way to check for errors using the CUDA runtime API?

Instead of modifying the code I presented I wrote my own for conciseness:

#include "cuda_runtime.h"
#include "device_launch_parameters.h"

#include <stdio.h>

#define gpuErrchk(ans) { gpuAssert((ans), __FILE__, __LINE__); }
inline void gpuAssert(cudaError_t cudaError, char *file, int line, bool abort=true)
{
   if (cudaError != cudaSuccess) 
   {
      fprintf(stderr,"GPUassert: %s %s %d\n", cudaGetErrorString(cudaError), file, line);
   }
}

__global__ void addKernel(const int *dev_a, const int *dev_b, int *dev_c)
{
    int i = threadIdx.x;
    if ( i < 5 )
        dev_c[i] = dev_a[i] + dev_b[i];
}

int main()
{
    const int arraySize = 5;
    const int a[arraySize] = { 1, 2, 3, 4, 5 };
    const int b[arraySize] = { 10, 20, 30, 40, 50 };
    int c[arraySize] = { 0 };

    int *dev_a(nullptr), *dev_b(nullptr), *dev_c(nullptr);

    gpuErrchk( cudaMalloc((void**)&dev_a, arraySize * sizeof(int)) );
    gpuErrchk( cudaMalloc((void**)&dev_b, arraySize * sizeof(int)) );
    gpuErrchk( cudaMalloc((void**)&dev_c, arraySize * sizeof(int)) );

    gpuErrchk( cudaMemcpy(dev_a, a, arraySize * sizeof(int), cudaMemcpyHostToDevice) );
    gpuErrchk( cudaMemcpy(dev_b, b, arraySize * sizeof(int), cudaMemcpyHostToDevice) );

    const int testMax1D = 1025; 
    dim3 testMax2D ( 32, 33 );

    addKernel<<<1, testMax2D>>> ( dev_a , dev_b, dev_c );
    gpuErrchk( cudaPeekAtLastError() );
    gpuErrchk( cudaDeviceSynchronize() );

    gpuErrchk(  cudaMemcpy( c, dev_c, arraySize * sizeof(int), cudaMemcpyDeviceToHost) );

    printf("{1,2,3,4,5} + {10,20,30,40,50} = {%d,%d,%d,%d,%d}\n",
        c[0], c[1], c[2], c[3], c[4]);

    cudaFree(dev_a);
    cudaFree(dev_b);
    cudaFree(dev_c);

    return 0;
}

I now get correct error reports. Thank you for your patience.

I don't understand this call in the gpuAssert function, so I ommited it:

if (abort) exit(code);

Is exit a custom written function or something I missed?

You need to [check the CUDA error](http://stackoverflow.com/questions/14038589/what-is-the-canonical-way-to-check-for-errors-using-the-cuda-runtime-api) after each kernel launch. When you exceed the maximum number of blocks or threads in a launch configuration, you'll get a "CUDA unspecified launch error" or somesuch. — Jared Hoberock, Apr 23 '13 at 23:30
@JaredHoberock I added the whole code. I (or rather the nVidia's example developers) seem to be checking errors on every possible step. I am missing one? — Dan S., Apr 23 '13 at 23:47
Yes. You don't understand how to check for kernel launch errors. Read the link Jared suggested carefully. — Robert Crovella, Apr 24 '13 at 00:01
I'm not sure where you got this code from or why you attribute it to NVIDIA. It may have come from NVIDIA, I don't know. But the NVIDIA vectorAdd sample is [here](http://docs.nvidia.com/cuda/cuda-samples/index.html#vector-addition), and it contains proper kernel error checking. — Robert Crovella, Apr 24 '13 at 00:54
@DanS: Please read the second half of [this answer](http://stackoverflow.com/a/14038590/681865) through until you understand it, then modify this code based on the suggestion in the linked answer, and then run your code. — talonmies, Apr 24 '13 at 05:46
@RobertCrovella If you install CUDA 5.0 toolkit, then start a new CUDA project in Visual studio 2010, you will get this code. I assumed (falsely?) it is written by nVidia developers. Who do you think it is written by? — Dan S., Apr 24 '13 at 12:40
Thanks for identifying the source of the code you originally posted. I wasn't aware of it. I don't think I've ever actually created a new CUDA project in VS, I've always copied from an existing (sample/SDK) project. But I'm going to try it now. — Robert Crovella, Apr 24 '13 at 16:21

score 3 · Accepted Answer · edited Apr 24 '13 at 05:40

There are two classes of errors that can occur with kernel launches and they need to be checked for in separate steps, following a particular order.

The first class of errors is reported synchronously when a kernel call is made and prior to the kernel actually being launched on the device, i.e. these are "pre-launch" errors. These errors typically involve requesting more of a particular resource than is available (e.g. too much shared memory, too many threads). Check for these by calling cudaGetLastError() immediately after a kernel call.

The second class of errors are those that occur at some point in time after the kernel was launched on the device (e.g. memory access violation, timeout of watchdog timer). These are "post-launch" errors. The reason they are reported some time after a kernel call, is a natural consequence of kernel launches occuring asynchronously. They are reported at the next opportunity, which is usually the next synchronous API call. Check for these by calling cudaDeviceSynchronize() and examining its status return.

The posted code is missing a check for errors of the first class.

CUDA more than max threads without errors?

1 Answers1