Wrong results of a CUDA dynamic parallelism code

Question

I recently bumped in the problem illustrated at Uncorrectable ECC error. Shortly speaking, from time to time I receive an Uncorrectable ECC error and my dynamic parallelism code generates uncorrect results. The most probable hypothesis of the uncorrectable ECC error is a corrupted driver stack, which has also been indirectly confirmed by the experience of another user (see the above post). I would now like to face the second issue, i.e., the algorithmic one. To this end, I'm dealing with the reproducer reported below which, since the original code generating uncorrect results uses dynamic parallelism, uses this CUDA feature too.

I do not see any evindent issue with this code. I think that the synchronization regarding the child kernel launch should be ok: the first __syncthreads() should not be necessary and the cudaDeviceSynchronize() should ensure that all the memory writes of the child kernel are accomplished before the printf.

My question is: is this code wrong or the wrong results are due to a non-programming issue?

My configuration: CUDA 5.0, Windows 7, 4-GPU system equipped with Kepler K20c, driver 327.23.

#include <stdio.h>
#include <conio.h>

#define K 6
#define BLOCK_SIZE 256

#define gpuErrchk(ans) { gpuAssert((ans), __FILE__, __LINE__); }
inline void gpuAssert(cudaError_t code, char *file, int line, bool abort=true)
{
    if (code != cudaSuccess) 
    {
        fprintf(stderr,"GPUassert: %s %s %d\n", cudaGetErrorString(code), file, line);
        if (abort) { getch(); exit(code); }
    }
}

int iDivUp(int a, int b) { return ((a % b) != 0) ? (a / b + 1) : (a / b); }

__global__ void child_kernel(double* P1) 
{
    int m = threadIdx.x;

    P1[m] = (double)m;
}

__global__ void parent_kernel(double* __restrict__ x, int M)
{
    int i = threadIdx.x + blockDim.x * blockIdx.x;

    if(i<M) {

        double* P1 = new double[13];

        dim3 dimBlock(2*K+1,1); dim3 dimGrid(1,1);

        __syncthreads();
        child_kernel<<<dimGrid,dimBlock>>>(P1);
        cudaDeviceSynchronize();

        for(int m=0; m<2*K+1; m++) printf("%f %f\n",P1[m],(double)m);

    }
}

int main() {

    const int M = 19000;

//gpuErrchk(cudaSetDevice(0));  

    double* x = (double*)malloc(M*sizeof(double));
    for (int i=0; i<M; i++) x[i] = (double)i;

    double* d_x; gpuErrchk(cudaMalloc((void**)&d_x,M*sizeof(double)));

    gpuErrchk(cudaMemcpy(d_x,x,M*sizeof(double),cudaMemcpyHostToDevice));

    dim3 dimBlock(BLOCK_SIZE,1); dim3 dimGrid(iDivUp(M,BLOCK_SIZE));
    parent_kernel<<<dimGrid,dimBlock>>>(d_x,M);
    gpuErrchk(cudaPeekAtLastError());
    gpuErrchk(cudaDeviceSynchronize());

    getch();

    return 0;
}

When you say you have incorrect results, what are they? Are you sure that this isn't something as simple as a `printf` buffer overflow? — talonmies, Jan 14 '14 at 10:37
@talonmies Typically, the first time I launch the code I receive "more or less" correct results, in that the two numbers in the same row mostly coincide. The subsequent times I launch it, an ever increasing number of `P1[m]` become different from `(double)m` with unreasonable `double` values. No, I'm not sure that this isn't a `printf` buffer overflow. To check it, I should shunt the `printf` and use the debugger or pass `P1` to the `main`, a test that I cannot do in the next hours since the machine is temporarily busy. — Vitality, Jan 14 '14 at 11:00
@talonmies But could you kindly point me some already known direction to avoid the `printf` buffer overflow, which is something anyhow good to know? — Vitality, Jan 14 '14 at 11:03
Have a look at `cudaDeviceSetLimit`. There are settings for the heap which your `new` allocates from and `printf` stores to. If they are not big enough, then the code won't work as expected. — talonmies, Jan 14 '14 at 11:09
Can you try a different GPU driver on that system? If so, please try [321.01](http://www.nvidia.com/download/driverResults.aspx/70859/en-us) or [321.10](http://www.nvidia.com/download/driverResults.aspx/71514/en-us). Although they are numerically lower than your driver, they are newer, and I believe they are worth a check. You should also reboot the machine during the install of each driver, and then repeat your test. I believe the Uncorrectable ECC error at least is due to a "non-programming" issue. — Robert Crovella, Jan 14 '14 at 13:36
There's a huge amount of printout, so a [printf overflow](http://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#formatted-output) may be happening. Wouldn't it be easier to use unsigned long long instead of double, and just compare the quantities in-kernel rather than printing out everything and have to look through reams of printout for a mismatch? Also, you're doing no error checking on your child kernel launch. You might be exceeding the launch pending limit (2048), since each of your 19000 parent threads are launching a child. — Robert Crovella, Jan 15 '14 at 23:49
Your in-kernel `new` may also fail, and you're not checking that either. 19000 x 13 x 8bytes = ~2MB so that is probably not exceeding the limit, but if in doubt you can check the returned pointer against null, which signifies an allocation error. — Robert Crovella, Jan 15 '14 at 23:52

Robert Crovella · Accepted Answer · 2014-01-16T00:18:49.703

I'm pretty sure you're exceeding the launch pending limit. It's nearly impossible to tell with your code as-is, but I've modified it and added error checking on the child kernel launch.

When I do that, I get launch errors, signified by a printout of !. Skipping the launch error cases, all of my in-kernel checking of P1[m] vs. m passes (I get no * printout at all.)

#include <stdio.h>

#define K 6
#define BLOCK_SIZE 256

#define gpuErrchk(ans) { gpuAssert((ans), __FILE__, __LINE__); }
inline void gpuAssert(cudaError_t code, char *file, int line, bool abort=true)
{
    if (code != cudaSuccess)
    {
        fprintf(stderr,"GPUassert: %s %s %d\n", cudaGetErrorString(code), file, line);
        if (abort) { exit(code); }
    }
}

int iDivUp(int a, int b) { return ((a % b) != 0) ? (a / b + 1) : (a / b); }

__global__ void child_kernel(unsigned long long* P1)
{
    int m = threadIdx.x;

    P1[m] = (unsigned long long)m;
}

__global__ void parent_kernel(double* __restrict__ x, int M)
{
    int i = threadIdx.x + blockDim.x * blockIdx.x;

    if(i<M) {

        unsigned long long* P1 = new unsigned long long[13];

        dim3 dimBlock(2*K+1,1); dim3 dimGrid(1,1);

        __syncthreads();
        child_kernel<<<dimGrid,dimBlock>>>(P1);
        cudaDeviceSynchronize();
        cudaError_t err = cudaGetLastError();
        if (err != cudaSuccess) printf("!");
        else for(unsigned long long m=0; m<dimBlock.x; m++) if (P1[m] != m) printf("*");

    }
}

int main() {

    const int M = 19000;

//gpuErrchk(cudaSetDevice(0));

    double* x = (double*)malloc(M*sizeof(double));
    for (int i=0; i<M; i++) x[i] = (double)i;

    double* d_x; gpuErrchk(cudaMalloc((void**)&d_x,M*sizeof(double)));

    gpuErrchk(cudaMemcpy(d_x,x,M*sizeof(double),cudaMemcpyHostToDevice));

    dim3 dimBlock(BLOCK_SIZE,1); dim3 dimGrid(iDivUp(M,BLOCK_SIZE));
    parent_kernel<<<dimGrid,dimBlock>>>(d_x,M);
    gpuErrchk(cudaPeekAtLastError());
    gpuErrchk(cudaDeviceSynchronize());

    return 0;
}

Feel free to add further decoding of the err variable in the parent kernel to convince yourself that you are exceeding the launch pending limit. As another test, you can set M to 2048 instead of 19000 in your host code, and all the ! printouts go away. (launch pending limit default == 2048)

As I've stated in the comments, I think the uncorrectable ECC error is a separate issue, and I suggest trying the driver 321.01 that I linked in the comments.

Thank you so much for your answer. At this moment I cannot test your code since the machine is stucked in a simulation by a colleague having a deadline and I will resume the problem on monday. In the meanwhile, I have studied your answer to this [post](http://stackoverflow.com/questions/17902314/some-child-grids-not-being-executed-with-cuda-dynamic-parallelism) which was illuminating for me. I was wrongly convinced that the error checking for the parent kernel would have captured errors also for child kernel launches and I will now add the error checking also for device kernel launches. — Vitality, Jan 16 '14 at 06:31
I have finally tested your code and indeed I'm exceeding the launch pending limit. Specifically, if I change your decoding of the `err` variable to `if (err == cudaErrorLaunchPendingCountExceeded) printf("!");`, then I receive the `!`'s. Also, if I set `gpuErrchk(cudaDeviceSetLimit(cudaLimitDevRuntimePendingLaunchCount, 20000));` at the beginning of the code, I do not receive the `!`'s anymore. Finally, my original code now correctly works by setting the pending launch count to `20000`. Thanks. I'll accept your answer. — Vitality, Jan 19 '14 at 21:31

Wrong results of a CUDA dynamic parallelism code

1 Answers1