cudaMalloc an array within a structure creates an "illegal memory access"

Question

below is a simplified version of a problem that I am trying to solve. Both code snipets compile, but #2 throws an "illegal memory access". Basically, if an array is encapsulated in a structure, passing a pointer to that structure to cudaMalloc creates all kind of problems -- at least the way I do it. I am pretty sure this is due to the fact that the address of dum in the code below is on the host, and so is not accessible inside the kernel. Problem is, I don't know how to create a device version of dum... E.g., using cudaMalloc( (void**)&dum , sizeof(dummy) * 1 ) instead of the new dummy syntax below does not solve the problem. I think I am getting confused with the double pointer used by cudaMalloc.

Of course it may seem silly in this example to encapsulate an array of double in a structure, in the actual code I really need to do this though.

struct dummy
{
  double *arr;
};



void allocate( dummy *dum , int n )
{
  cudaMalloc( (double**)&(dum->arr) , sizeof(double) * n );
}



__global__ void test( double val , dummy *dum , int n )
{
  printf( "test\n" );
  for( int ii = 0 ; ii < n ; ii++ )
    dum->arr[ii] = val;
}


__global__ void test2( double val , double *arr , int n )
{
  printf( "test\n" );
  for( int ii = 0 ; ii < n ; ii++ )
    arr[ii] = val;
}


int main()
{

  int n = 10;

  dummy *dum = new dummy;


  /* CODE 1: the piece of code below works */
  double *p;
  gpu_err_chk( cudaMalloc( &p , sizeof(double) * n ) );
  test2<<< 1 , 1 >>>( 123.0 , p , n );
  gpu_err_chk( cudaDeviceSynchronize() );


  /* CODE 2: the piece of code below does not... */
  allocate( dum , n );
  test<<< 1 , 1 >>>( 123.0 , dum , n );
  gpu_err_chk( cudaDeviceSynchronize() );

  return 1;

}

`dum` is a host pointer. That is, it points to a location in host memory. I know this because it was allocated with `new`, a host memory allocator. If you pass a host pointer to device code, and attempt to do anything (at all) with it, you are going to run into trouble. This is one of the most commonly asked questions about CUDA here on the `cuda` SO tag. There are many questions like it. [here](https://stackoverflow.com/questions/15431365/cudamemcpy-segmentation-fault/15435592#15435592) is one. — Robert Crovella, Jan 22 '20 at 23:47
Or just pass the structure by value to the kernel, not try and pass a pointer to the structure — talonmies, Jan 23 '20 at 07:03

Bastien · Answer 1 · 2020-01-23T18:01:30.783

After digging through some example in previous posts by Robert, I was able to re-write the code so that it works:

struct dummy
{
  double *arr;
};



__global__ void test( dummy *dum , int n )
{
  printf( "test\n" );
  for( int ii = 0 ; ii < n ; ii++ )
    printf( "dum->arr[%d] = %f\n" , ii , dum->arr[ii] );

}



int main()
{

  int n = 10;

  dummy *dum_d , *dum_h;

  srand( time(0) );

  dum_h  = new dummy;
  dum_h->arr = new double[n];
  for( int ii = 0 ; ii < n ; ii++  ){
    dum_h->arr[ii]  = double( rand() ) / RAND_MAX;
    printf( "reference data %d = %f\n" , ii , dum_h->arr[ii] );
  }

  cudaMalloc( &dum_d , sizeof(dummy) * 1 );
  cudaMemcpy( dum_d , dum_h , sizeof(dummy) * 1 , cudaMemcpyHostToDevice );

  double *tmp;
  cudaMalloc( &tmp , sizeof(double) * n );
  cudaMemcpy( &( dum_d->arr ) , &tmp , sizeof(double*) , cudaMemcpyHostToDevice );  // copy the pointer (host) to the device structre to a device pointer               
  cudaMemcpy( tmp , dum_h->arr , sizeof(double) * n , cudaMemcpyHostToDevice );

  delete [] dum_h->arr;
  delete dum_h;

  test<<< 1 , 1 >>>( dum_d , n );
  gpu_err_chk( cudaDeviceSynchronize() );

  cudaFree( tmp );
  cudaFree( dum_d );

  return 1;

}

However, I am still confused why this works. Does anyone have a visual diagram of what's going on? I am getting lost with the different pointers...

Moreover, there is one thing that really blows my mind: I can free tmp right before the kernel launch and the code still works, i.e.:

  cudaFree( tmp );

  test<<< 1 , 1 >>>( dum_d , n );
  gpu_err_chk( cudaDeviceSynchronize() );

How is this the case? In my mind (clearly wrong), the device array containing the random values is gone...

Another point of confusion is that I can't free dum_d->arr directly (freeCuda(dum_d->arr)), this throws a segmentation fault.

Please refrain from asking question in answers. This is not how [SO] works — talonmies, Jan 23 '20 at 18:32
sorry... You're right, I should have left a comment. Any idea bout the questions? — Bastien, Jan 23 '20 at 23:08

cudaMalloc an array within a structure creates an "illegal memory access"

1 Answers1