CUDA/Thrust double pointer problem (vector of pointers)

Question

Hey all, I am using CUDA and the Thrust library. I am running into a problem when I try to access a double pointer on the CUDA kernel loaded with a thrust::device_vector of type Object* (vector of pointers) from the host. When compiled with 'nvcc -o thrust main.cpp cukernel.cu' i receive the warning 'Warning: Cannot tell what pointer points to, assuming global memory space' and a launch error upon attempting to run the program.

I have read the Nvidia forums and the solution seems to be 'Don't use double pointers in a CUDA kernel'. I am not looking to collapse the double pointer into a 1D pointer before sending to the kernel...Has anyone found a solution to this problem? The required code is below, thanks in advance!

--------------------------
        main.cpp
--------------------------

Sphere * parseSphere(int i)
{
  Sphere * s = new Sphere();
  s->a = 1+i;
  s->b = 2+i;
  s->c = 3+i;
  return s;
}

int main( int argc, char** argv ) {

  int i;
  thrust::host_vector<Sphere *> spheres_h;
  thrust::host_vector<Sphere> spheres_resh(NUM_OBJECTS);

  //initialize spheres_h
  for(i=0;i<NUM_OBJECTS;i++){
    Sphere * sphere = parseSphere(i);
    spheres_h.push_back(sphere);
  }

  //initialize spheres_resh
  for(i=0;i<NUM_OBJECTS;i++){
    spheres_resh[i].a = 1;
    spheres_resh[i].b = 1;
    spheres_resh[i].c = 1;
  }

  thrust::device_vector<Sphere *> spheres_dv = spheres_h;
  thrust::device_vector<Sphere> spheres_resv = spheres_resh;
  Sphere ** spheres_d = thrust::raw_pointer_cast(&spheres_dv[0]);
  Sphere * spheres_res = thrust::raw_pointer_cast(&spheres_resv[0]);

  kernelBegin(spheres_d,spheres_res,NUM_OBJECTS);

  thrust::copy(spheres_dv.begin(),spheres_dv.end(),spheres_h.begin());
  thrust::copy(spheres_resv.begin(),spheres_resv.end(),spheres_resh.begin());

  bool result = true;

  for(i=0;i<NUM_OBJECTS;i++){
    result &= (spheres_resh[i].a == i+1);
    result &= (spheres_resh[i].b == i+2);
    result &= (spheres_resh[i].c == i+3);
  }

  if(result)
  {
    cout << "Data GOOD!" << endl;
  }else{
    cout << "Data BAD!" << endl;
  }

  return 0;
}


--------------------------
        cukernel.cu
--------------------------
__global__ void deviceBegin(Sphere ** spheres_d, Sphere * spheres_res, float    
num_objects)
{
  int index = threadIdx.x + blockIdx.x*blockDim.x;

  spheres_res[index].a = (*(spheres_d+index))->a; //causes warning/launch error
  spheres_res[index].b = (*(spheres_d+index))->b; 
  spheres_res[index].c = (*(spheres_d+index))->c; 
}

void kernelBegin(Sphere ** spheres_d, Sphere * spheres_res, float num_objects)
{

 int threads = 512;//per block
 int grids = ((num_objects)/threads)+1;//blocks per grid

 deviceBegin<<<grids,threads>>>(spheres_d, spheres_res, num_objects);
}

score 4 · Answer 1 · edited May 23 '17 at 12:15

The basic problem here is that device vector spheres_dv contains host pointers. Thrust cannot do "deep copying" or pointer translation between the GPU and host CPU address spaces. So when you copy spheres_h to GPU memory, you are winding up with a GPU array of host pointers. Indirection of host pointers on the GPU is illegal - they are pointers in the wrong memory address space, thus you are getting the GPU equivalent of a segfault inside the kernel.

The solution is going to involve replacing your parseSphere function with something that performs memory allocation on the GPU, rather than using the parseSphere, which presently allocates each new structure in host memory. If you had a Fermi GPU (which it appears you do not) and are using CUDA 3.2 or 4.0, then one approach would be to turn parseSphere into a kernel. The C++ new operator is supported in device code, so structure creation would occur in device memory. You would need to modify the definition of Sphere so that the constructor is defined as a __device__ function for this approach to work.

The alternative approach will involve creating a host array holding device pointers, then copy that array to device memory. You can see an example of that in this answer. Note that it is probably the case that declaring a thrust::device_vector containing thrust::device_vector won't work, so you will likely need to do this array of device pointers construction using the underlying CUDA API calls.

You should also note that I haven't mentioned the reverse copy operation, which is equally as difficult to do.

The bottom line is that thrust (and C++ STL containers for that matter) really are not intended to hold pointers. They are intended to hold values, and abstract away pointer indirection and direct memory access through the use of iterators and underlying algorithms which the user isn't supposed to see. Further, the "deep copy" problem is main the reason why the wise people on the NVIDIA forums counsel against multiple levels of pointers in GPU code. It greatly complicates code, and it executes slower on the GPU as well.

Awesome, thanks for the great response...I'll try some of these ideas out and get back with the results! — nhelenih, Jun 06 '11 at 17:01

CUDA/Thrust double pointer problem (vector of pointers)

1 Answers1