Synchronization in CUDA dynamic parallelism

Question

I am testing dynamic parallelism with the following kernel, the one that gets the maximum value of an integer array using dynamic parallelism in a divide and conquer fashion:

__global__ void getMax(int * arr, int ini, int fin, int * maxv) {

  if (ini >= fin) return;

  if (fin-ini==1) {

    *maxv = arr[ini];
    printf("Elem: %d (ini:%d)\n", *maxv, ini);

  } else {

    int * max1, * max2;
    max1 = (int *) malloc(sizeof(int));
    max2 = (int *) malloc(sizeof(int));

    getMax<<<1,1>>>(arr, ini, (fin+ini)/2, max1);
    getMax<<<1,1>>>(arr, (fin+ini)/2, fin, max2);
    cudaDeviceSynchronize();

    printf("Max1: %d, Max2: %d (ini:%d,fin:%d)\n",
        *max1, *max2, ini, fin);
    *maxv = max(*max1, *max2);

    free(max1); free(max2);

  }
}

The one gets called as: getMax<<<1,1>>>(d_arr, 0, N, d_max), with d_arr the array, N its size and d_max its maximum value. Although sometimes I get the right output, this one has the properties that I tend to see in the wrong ones:

10 6 8 7 14 4 0 4 9 8 6 4 8 10 5 1 
Max1: 0, Max2: 0 (ini:0,fin:4)
Elem: 10 (ini:0)
Max1: 10, Max2: 0 (ini:0,fin:2)
Elem: 6 (ini:1)
Elem: 8 (ini:2)
Max1: 8, Max2: 0 (ini:2,fin:4)
Elem: 7 (ini:3)
Max1: 8, Max2: 8 (ini:4,fin:8)
Elem: 14 (ini:4)
Max1: 14, Max2: 6 (ini:4,fin:6)
Elem: 4 (ini:5)
Elem: 0 (ini:6)
Max1: 0, Max2: 8 (ini:6,fin:8)
Elem: 4 (ini:7)
Max1: 0, Max2: 8 (ini:0,fin:8)
Max1: 0, Max2: 4 (ini:8,fin:12)
Elem: 9 (ini:8)
Max1: 9, Max2: 4 (ini:8,fin:10)
Elem: 8 (ini:9)
Elem: 6 (ini:10)
Max1: 6, Max2: 4 (ini:10,fin:12)
Elem: 4 (ini:11)
Max1: 6, Max2: 6 (ini:12,fin:16)
Elem: 8 (ini:12)
Max1: 8, Max2: 8 (ini:12,fin:14)
Elem: 10 (ini:13)
Elem: 5 (ini:14)
Max1: 5, Max2: 6 (ini:14,fin:16)
Elem: 1 (ini:15)
Max1: 4, Max2: 6 (ini:8,fin:16)
Max1: 8, Max2: 6 (ini:0,fin:16)
Device max: 8
Host max: 14

As you can see, there are many times in which father grids print before their children finish execution, although cudaDeviceSynchronize() is being used. Even worse, some children values are not being considered in the final output, getting a wrong result from the GPU.

I know the use of malloc inside kernels (using global memory) and of dynamic parallelism itself are currently not fast enough for this code to have good speedup over CPU. I just would love to understand why this code is not being synchronized properly.

score 4 · Accepted Answer · answered Dec 23 '17 at 23:10

Any time you are having trouble in a CUDA code, it's recommended to run your code with cuda-memcheck and also to do proper CUDA error checking. For CUDA Dynamic Parallelism (CDP) codes, you can (and should) do error checking on device-side kernel launches and device usage of the runtime API in an identical fashion. Even if you don't understand the error output generated, it will be useful for those trying to help you.

Furthermore, when asking for help with a code that is not working, you are supposed to provide a MCVE. But in this case I was able to recreate your observation by adding my own host test code.

The problem in this case seems to be that you are exceeding the default nesting and synchronization depth associated with CDP, as discussed here.

By adding an additional line to the beginning of the host code:

cudaError_t err = cudaDeviceSetLimit(cudaLimitDevRuntimeSyncDepth, 16);

I was able to make the observed problem go away in my test case.

However note in the documentation that the max limit here is 24, so your recursive mechanism requiring device synchronization at each recursion depth level is not very workable as your problem size gets larger.

I assume you are just pursuing this as a learning exercise. If you're actually interested in efficient max-finding, there are far more efficient reduction based techniques.

Synchronization in CUDA dynamic parallelism

1 Answers1