Nested parallelism in Cuda

Question

In the following code, I want to use nested parallelism to compute 10 times an element of an array. I use this simple example to learn more about dynamic parallelism in Cuda. The way the code works is that for each element of a parentArray, there is another kernel saving this element at a position of a childArray (0 to 9). So for each element of the parentArray, I have another array with 10 elements, each being equal to the element of the parentArray. In the end, I compute the sum of all childArrays and save the result in the parentArray.

The result therefore should be:

Element 0 of parentArray, Result = 0
Element 1 of parentArray, Result = 10
Element 2 of parentArray, Result = 20 and so on

Currently, the code compiles but doesn't give the expected results. What is wrong with the current code?

The function to compute the sum of the elements

__device__ double summe(double *arr, int size)
{
  double result = 0.0;
  for(int i = 0; i < size; i++)
  {
    result += arr[i];
  }
  return result;
}

The function called from childKernel

__device__ double getElement(double arrElement)
{
  return arrElement;
}

The array in which results are stored

__device__ double childArr[10];

The childKernel

__global__ void childKernel(double *arr, double arrElement,int N)
{
  int cidx = blockIdx.x * blockDim.x + threadIdx.x;
  if (cidx < N)
  {
    arr[cidx] = getElement(arrElement);
  }
}

The parentKernel

__global__ void parentKernel(double *parentArray, int N)
{
  int idx = blockIdx.x * blockDim.x + threadIdx.x;
  if (idx < N)
  {
    childKernel<<<1,10>>>(childArr,parentArray[idx],N);
    __syncthreads();
    parentArray[idx] = summe(childArr,10);

  }

}

The main part

 int main(void)
    {

      double *host_array;
      double *device_array;

      // Number of elements in arrays
      const int N_array = 10;

      // size of array
      const size_t size_array = N_array * sizeof(double);

      // Allocate array on host
      host_array = (double *)malloc(size_array);

      // Allocate array on device
      CUDA_CALL(cudaMalloc((void **) &device_array, size_array));

      // Initialize host array
      for (int i=0; i<N_array; i++)
      {
        host_array[i] = (double)i;
      }

      // and copy it to CUDA device
      CUDA_CALL(cudaMemcpy(device_array, host_array, size_array, cudaMemcpyHostToDevice));

      // Do calculation on device:
      int block_size = 4;
      // if N = 10, then n_blocks = 3
      int n_blocks = N_array/block_size + (N_array % block_size == 0 ? 0:1);

      parentKernel<<<n_blocks, block_size>>>(device_array,N_array);

      // Retrieve result from device and store it in host array
      CUDA_CALL(cudaMemcpy(host_array, device_array, sizeof(double)*N_array, cudaMemcpyDeviceToHost));

  // Print results
  for (int i=0; i<N_array; i++)
  {
    printf("Element %d of parentArray, Result = %f\n", i, host_array[i]);
  }

  // Cleanup
  free(host_array);
  CUDA_CALL(cudaFree(device_array));

}

The result I get is:

0 52.000000
1 52.000000
2 52.000000
3 52.000000
4 48.000000
5 48.000000
6 48.000000
7 48.000000
8 48.000000
9 48.000000

I use Cuda 6.5
NVCCFLAGS= -arch=sm_35 -rdc=true -G -O3 --compiler-options -Wall

/opt/cuda-6.5/bin/nvcc -V 

nvcc: NVIDIA (R) Cuda compiler driver    
Copyright (c) 2005-2014 NVIDIA Corporation    
Built on Thu_Jul_17_21:41:27_CDT_2014   
Cuda compilation tools, release 6.5, V6.5.12

Please edit your GPU details and the command used to compile the code into your question — talonmies, Nov 09 '15 at 18:09

Robert Crovella · Accepted Answer · 2015-11-09T18:43:04.103

You are launching 10 kernels at this point (each child kernel also having 10 threads), one from each of the 10 active parent kernel threads:

childKernel<<<1,10>>>(childArr,parentArray[idx],N);

Those 10 kernels will run in any order, completely asynchronously to each other. Furthermore each one of those 10 kernels is attempting to write values into the same 10 locations in childArr. So this is a race condition. The final results in childArr at this point:

__syncthreads();

will be unpredictable.

One possible method to avoid the race condition would be to have each child kernel write into a separate portion of childArr.

Another problem is the use of __syncthreads() instead of cudaDeviceSynchronize() as a barrier in the kernel. Kernel launches, whether from host or device code, are asynchronous, and __syncthreads() does not guarantee that asynchronously-launched prior work is complete. cudaDeviceSynchronize() causes the calling thread to pause until all previous kernels launched by that thread are complete. (and see note below)

With those two changes, your code can produce the output you are expecting:

$ cat t11.cu
#include <stdio.h>
#define CUDA_CALL(x) x
#define MY_M 10
#define MY_N 10

__device__ double childArr[MY_M*MY_N];

__device__ double summe(double *arr, int size)
{
  double result = 0.0;
  for(int i = 0; i < size; i++)
  {
    result += arr[i];
  }
  return result;
}

__device__ double getElement(double arrElement)
{
  return arrElement;
}

__global__ void childKernel(double *arr, double arrElement,int N)
{
  int cidx = blockIdx.x * blockDim.x + threadIdx.x;
  if (cidx < N)
  {
    arr[cidx] = getElement(arrElement);
  }
}

__global__ void parentKernel(double *parentArray, int N)
{
  int idx = blockIdx.x * blockDim.x + threadIdx.x;
  if (idx < N)
  {
    childKernel<<<1,MY_M>>>(childArr+MY_M*idx,parentArray[idx],N);
    cudaDeviceSynchronize();
    parentArray[idx] = summe(childArr+MY_M*idx,MY_M);

  }

}

int main(void)
    {

      double *host_array;
      double *device_array;

      // Number of elements in arrays
      const int N_array = MY_N;

      // size of array
      const size_t size_array = N_array * sizeof(double);

      // Allocate array on host
      host_array = (double *)malloc(size_array);

      // Allocate array on device
      CUDA_CALL(cudaMalloc((void **) &device_array, size_array));

      // Initialize host array
      for (int i=0; i<N_array; i++)
      {
        host_array[i] = (double)i;
      }

      // and copy it to CUDA device
      CUDA_CALL(cudaMemcpy(device_array, host_array, size_array, cudaMemcpyHostToDevice));

      // Do calculation on device:
      int block_size = 4;
      // if N = 10, then n_blocks = 3
      int n_blocks = N_array/block_size + (N_array % block_size == 0 ? 0:1);

      parentKernel<<<n_blocks, block_size>>>(device_array,N_array);

      // Retrieve result from device and store it in host array
      CUDA_CALL(cudaMemcpy(host_array, device_array, sizeof(double)*N_array, cudaMemcpyDeviceToHost));

  // Print results
  for (int i=0; i<N_array; i++)
  {
    printf("Element %d of parentArray, Result = %f\n", i, host_array[i]);
  }

  // Cleanup
  free(host_array);
  CUDA_CALL(cudaFree(device_array));

}


$ nvcc -arch=sm_52 -rdc=true -o t11 t11.cu -lcudadevrt
$ cuda-memcheck ./t11
========= CUDA-MEMCHECK
Element 0 of parentArray, Result = 0.000000
Element 1 of parentArray, Result = 10.000000
Element 2 of parentArray, Result = 20.000000
Element 3 of parentArray, Result = 30.000000
Element 4 of parentArray, Result = 40.000000
Element 5 of parentArray, Result = 50.000000
Element 6 of parentArray, Result = 60.000000
Element 7 of parentArray, Result = 70.000000
Element 8 of parentArray, Result = 80.000000
Element 9 of parentArray, Result = 90.000000
========= ERROR SUMMARY: 0 errors
$

Note that CDP codes should usually be compiled:

for a compute capability of 3.5 or higher
with the -rdc=true switch (or equivalent sequence, e.g. -dc followed by device link)
with the -lcudadevrt switch, to pick up the device runtime library.

Note: In fact, cudaDeviceSynchronize() called from a parent thread, after a previous child kernel call, will pause that thread until all previously launched kernels from any thread in the block are complete. (documentation) However, since threads in a block are not guaranteed to execute in lock-step with each other, it may not be obvious what kernels in other threads have launched at a particular point. Therefore, correct usage might involve __syncthreads() (to guarantee that child kernels in other threads have launched) followed immediately by cudaDeviceSynchronize() to guarantee that those child kernels have completed, if that is the desired behavior. However, in this particular case, a given parent thread's result does not depend on completion of other parent threads child kernels, therefore we can omit the __syncthreads() in this case, just replacing it with cudaDeviceSynchronize().

If I would like to work with random numbers inside the childKernel (the function getElement in this case), should I then call a setup_kernel<<<>>> for the curandStates inside the parentKernel? Thanks! — beginneR, Nov 23 '15 at 13:36
You could. Or you could call a setup kernel on the host prior to the launch of your parent/child kernel. Or you could just setup in the child kernel as demonstrated [here](http://stackoverflow.com/questions/33848400/do-while-dont-work-inside-cuda-kernel/33848872#33848872). — Robert Crovella, Nov 23 '15 at 14:32
Ok thanks, but [here](http://stackoverflow.com/questions/33846594/runif-function-in-cuda) it says that one shouldn't create a state inside a kernel due to performance loss. It would be nice to see some code for my question here where getElement() generates a single random number. Maybe I'll post another question concerning this? — beginneR, Nov 23 '15 at 15:51
It says "creating and initializing of a **new** local state" will kill performance. State has to be created once per thread, that is unavoidable. What you don't want to do is create it over and over for a single thread. In fact I pointed that out in my answer that I linked [here](http://stackoverflow.com/questions/33848400/do-while-dont-work-inside-cuda-kernel/33848872#33848872). Yes, I'd suggest creating a new question if you have new questions. — Robert Crovella, Nov 23 '15 at 16:01
To be clear, I suggested creating a **new** question, not editing this one. — Robert Crovella, Nov 23 '15 at 16:17
Let us [continue this discussion in chat](http://chat.stackoverflow.com/rooms/95952/discussion-between-beginner-and-robert-crovella). — beginneR, Nov 23 '15 at 16:18

Nested parallelism in Cuda

1 Answers1