2

I am trying to use dynamic parallelism in cuda. I am in a situation such that parent kernel has a variable that needs to be passed to child for further computation. I have gone through the resources in web here

and it mentions that local variables cannot be passed to the child kernal and has mentioned the ways to pass variables and I have tried to pass the pass the variable as

#include <stdio.h>
#include <cuda.h>


__global__ void square(float *a, int N)
{
  int idx = blockIdx.x * blockDim.x + threadIdx.x;

  if(N==10)
  {
  a[idx] = a[idx] * a[idx];
  }
}
// Kernel that executes on the CUDA device
__global__ void first(float *arr, int N)
{
  int idx = blockIdx.x * blockDim.x + threadIdx.x;
  int n=N; // this value of n can be changed locally and need to be passed
  printf("%d\n",n);
  cudaMalloc((void **) &n, sizeof(int));

  square <<< 1, N >>> (arr, n);

}

// main routine that executes on the host
int main(void)
{
  float *a_h, *a_d;  // Pointer to host & device arrays
  const int N = 10;  // Number of elements in arrays
  size_t size = N * sizeof(float);
  a_h = (float *)malloc(size);        // Allocate array on host
  cudaMalloc((void **) &a_d, size);   // Allocate array on device
  // Initialize host array and copy it to CUDA device
  for (int i=0; i<N; i++) a_h[i] = (float)i;
  cudaMemcpy(a_d, a_h, size, cudaMemcpyHostToDevice);
  // Do calculation on device:

  first <<< 1, 1 >>> (a_d, N);
  //cudaThreadSynchronize();
  // Retrieve result from device and store it in host array
  cudaMemcpy(a_h, a_d, sizeof(float)*N, cudaMemcpyDeviceToHost);
  // Print results
  for (int i=0; i<N; i++) printf("%d %f\n", i, a_h[i]);
  // Cleanup
  free(a_h); cudaFree(a_d);
}

and the value of parent to child kernel is not passed . how can I pass the value of local variable. Is there any way to do so?

Malacu
  • 191
  • 2
  • 10

1 Answers1

3

This operation is not appropriate:

int n=N; // this value of n can be changed locally and need to be passed

cudaMalloc((void **) &n, sizeof(int)); // illegal

It is not appropriate in host code, nor in device code. n is an int variable. You are not supposed to assign a pointer to it. When you attempt to do so in a 64-bit environment, you are attempting to write a 64-bit pointer on top of a 32-bit int quantity. It will not work.

It's not clear why you would need it anyway. n is an integer parameter presumably specifying the size of your arr array of float. You don't need to allocate anything on top of it.

If you had run this code with cuda-memcheck, you could easily discover that error. You can also do proper cuda error checking in device code in exactly the same fashion as you do it in host code.

When I comment out that cudaMalloc line in the first kernel, your code runs correctly for me.

Community
  • 1
  • 1
Robert Crovella
  • 143,785
  • 11
  • 213
  • 257
  • yes, it is working without cudaMalloc. However it was mentioned in the documnet that local variable cannot be passed to child kernel but in the above example passing local variable io working well. How is that possible?? – Malacu Sep 09 '14 at 02:42
  • 2
    Local variables can be passed to the child kernel if they are passed *by value* as a kernel parameter. The document indicates that *pointers* to local variables should not be passed. Your code as written passes `n` by value in `first`. – Robert Crovella Sep 09 '14 at 13:35