I am testing dynamic parallelism with the following kernel, the one that gets the maximum value of an integer array using dynamic parallelism in a divide and conquer fashion:
__global__ void getMax(int * arr, int ini, int fin, int * maxv) {
if (ini >= fin) return;
if (fin-ini==1) {
*maxv = arr[ini];
printf("Elem: %d (ini:%d)\n", *maxv, ini);
} else {
int * max1, * max2;
max1 = (int *) malloc(sizeof(int));
max2 = (int *) malloc(sizeof(int));
getMax<<<1,1>>>(arr, ini, (fin+ini)/2, max1);
getMax<<<1,1>>>(arr, (fin+ini)/2, fin, max2);
cudaDeviceSynchronize();
printf("Max1: %d, Max2: %d (ini:%d,fin:%d)\n",
*max1, *max2, ini, fin);
*maxv = max(*max1, *max2);
free(max1); free(max2);
}
}
The one gets called as: getMax<<<1,1>>>(d_arr, 0, N, d_max)
, with d_arr the array, N its size and d_max its maximum value. Although sometimes I get the right output, this one has the properties that I tend to see in the wrong ones:
10 6 8 7 14 4 0 4 9 8 6 4 8 10 5 1
Max1: 0, Max2: 0 (ini:0,fin:4)
Elem: 10 (ini:0)
Max1: 10, Max2: 0 (ini:0,fin:2)
Elem: 6 (ini:1)
Elem: 8 (ini:2)
Max1: 8, Max2: 0 (ini:2,fin:4)
Elem: 7 (ini:3)
Max1: 8, Max2: 8 (ini:4,fin:8)
Elem: 14 (ini:4)
Max1: 14, Max2: 6 (ini:4,fin:6)
Elem: 4 (ini:5)
Elem: 0 (ini:6)
Max1: 0, Max2: 8 (ini:6,fin:8)
Elem: 4 (ini:7)
Max1: 0, Max2: 8 (ini:0,fin:8)
Max1: 0, Max2: 4 (ini:8,fin:12)
Elem: 9 (ini:8)
Max1: 9, Max2: 4 (ini:8,fin:10)
Elem: 8 (ini:9)
Elem: 6 (ini:10)
Max1: 6, Max2: 4 (ini:10,fin:12)
Elem: 4 (ini:11)
Max1: 6, Max2: 6 (ini:12,fin:16)
Elem: 8 (ini:12)
Max1: 8, Max2: 8 (ini:12,fin:14)
Elem: 10 (ini:13)
Elem: 5 (ini:14)
Max1: 5, Max2: 6 (ini:14,fin:16)
Elem: 1 (ini:15)
Max1: 4, Max2: 6 (ini:8,fin:16)
Max1: 8, Max2: 6 (ini:0,fin:16)
Device max: 8
Host max: 14
As you can see, there are many times in which father grids print before their children finish execution, although cudaDeviceSynchronize()
is being used. Even worse, some children values are not being considered in the final output, getting a wrong result from the GPU.
I know the use of malloc inside kernels (using global memory) and of dynamic parallelism itself are currently not fast enough for this code to have good speedup over CPU. I just would love to understand why this code is not being synchronized properly.