2

I want to create an object on the device and allocate it to a pointer available on the host. Is there something I'm doing wrong in here?

__global__ void createAProduction(DeviceProduction* production) {
    production = new AProduction();
}

DeviceProduction * devAProduction = NULL;
cudaMalloc(&devAProduction, sizeof(AProduction));
createAProduction<<<1, 1>>>(devAProduction);
deviceProductions["A"] = devAProduction;

Somewhere further in the code I'd like to do sth. like:

BatchOperation ** devBatchOperations;
    cudaMalloc((void **) &devBatchOperations, sizeof(BatchOperation *) * operationCount);

Then I populate that pointer array with that:

void DeviceBatchExecutor::execute(vector<BatchOperation> operationsToPerform) {
    BatchOperation ** devBatchOperations;
    cudaMalloc((void **) &devBatchOperations, sizeof(BatchOperation *) * operationsToPerform.size());
    int i = 0;
    for(batchOperationIt it = operationsToPerform.begin(); it != operationsToPerform.end(); ++it) {
        BatchOperation * devBatchOperation;
        cudaMalloc(&devBatchOperation, sizeof(BatchOperation));
        cudaMemcpy(&devBatchOperation, &it, sizeof(BatchOperation), cudaMemcpyHostToDevice);
        Vertex * devInputNode = it->inputNode->allocateToDevice();
        cudaMemcpy(&(devBatchOperation->inputNode), &devInputNode, sizeof(Vertex *), cudaMemcpyDeviceToDevice);
        cudaMemcpy(&(devBatchOperation->production), &(it->production), sizeof(Production *), cudaMemcpyDeviceToDevice);
        cudaMemcpy(&devBatchOperations[i], &devBatchOperation, sizeof(BatchOperation *), cudaMemcpyDeviceToDevice);
        i++;
    }
    int operationCount = operationsToPerform.size();
    executeOperations<<<operationCount, 1>>>(devBatchOperations);
}

where production is a pointer to the device memory holding that created object AProduction. Then I finally invoke processing via

executeOperations<<<operationCount, 1>>>(devBatchOperations);

So I'm relying on virtual method calls. As those DeviceProduction objects were created on the device, there is also a virtual pointer table so it should work. See example here. But it doesn't since the received batch operations seem random... crashes on invocation.

__global__ void executeOperations(BatchOperation ** operation) {    
    operation[blockIdx.x]->production->apply(operation[blockIdx.x]->inputNode);
}

Batch operation is a struct holding the production to be executed.

struct BatchOperation {
    Production * production;
    Vertex * inputNode;
    Vertex * outputNode;
};
kboom
  • 2,279
  • 3
  • 28
  • 43

2 Answers2

2

Is there something I'm doing wrong in here?

Yes, probably. The pointer production is passed to the kernel by value:

createAProduction<<<1, 1>>>(devAProduction);

It points to a location in device memory somewhere, since you've already run cudaMalloc on it. This line of kernel code:

production = new AProduction();

overwrites the pass-by-value copy of the production pointer with a new one, returned by in-kernel new. That is almost certainly not what you had intended. (And you haven't defined what AProduction is.). At the completion of that kernel call, the pass-by-value "copy" of the pointer will be lost anyway. You might be able to fix it like this:

*production = *(new DeviceProduction());

Now your production pointer points to a region in device memory that holds an instantiated (on the device) object, which appears to be your intent there. Creating a new object just to copy it may not be necessary, but that is not the crux of the issue I'm trying to point out here. You can probably also "fix" this issue by passing a pointer-to-pointer to the kernel instead. You would then need to allocate for an array of pointers, and assign one of the individual pointers using the in-kernel new directly, as you have shown.

The remainder of your code has a great many items undefined. For example in the above code it's not clear why you would declare that production is a pointer to a DeviceProduction type, but then try to allocate an AProduction type to it. Presumably that is some form of object inheritance which is unclear.

Since you haven't really provided anything approaching a complete code, I've borrowed some pieces from here to put together a complete worked example, showing object creation/setup in one kernel, followed by another kernel that invokes virtual methods on those objects:

$ cat t1086.cu
#include <stdio.h>
#define N 4


class Polygon {
  protected:
    int width, height;
  public:
  __host__ __device__  void set_values (int a, int b)
      { width=a; height=b; }
  __host__ __device__  virtual int area ()
      { return 0; }
};

class Rectangle: public Polygon {
  public:
  __host__ __device__  int area ()
      { return width * height; }
};

class Triangle: public Polygon {
  public:
  __host__ __device__   int area ()
      { return (width * height / 2); }
};

__global__ void setup_f(Polygon ** d_polys) {
  int idx = threadIdx.x+blockDim.x*blockIdx.x;
  if (idx < N) {
    if (idx%2)
      d_polys[idx] = new Rectangle();
    else
      d_polys[idx] = new Triangle();
    d_polys[idx]->set_values(5,12);
}};

__global__ void area_f(Polygon ** d_polys) {
  int idx = threadIdx.x+blockDim.x*blockIdx.x;
  if (idx < N){
    printf("area of object %d = %d\n", idx, d_polys[idx]->area());
}};


int main () {

  Polygon **devPolys;
  cudaMalloc(&devPolys,N*sizeof(Polygon *));
  setup_f<<<1,N>>>(devPolys);
  area_f<<<1,N>>>(devPolys);
  cudaDeviceSynchronize();
}
$ nvcc -o t1086 t1086.cu
$ cuda-memcheck ./t1086
========= CUDA-MEMCHECK
area of object 0 = 30
area of object 1 = 60
area of object 2 = 30
area of object 3 = 60
========= ERROR SUMMARY: 0 errors
$
Community
  • 1
  • 1
Robert Crovella
  • 143,785
  • 11
  • 213
  • 257
  • Great! This part of a code seems fine now. Could you have a look at **executeOperations** function call? It's where I pass the function to be executed to the device (Production pointer in BatchOperation structure). Before that I must copy the production pointer to the other structure allocated on the device. Am I doing it correctly? – kboom Feb 24 '16 at 14:42
  • No, there are certainly errors in your `cudaMemcpy` operations. You're not doing any cuda error checking, are you? Is there a reason you are not using rigorous error checking? If you are looking for debugging help, you are supposed to provide a complete code. Take a look at my answer for an example of a complete code that demonstrates one concept. Don't provide pages and pages of your own code. Simplify it to just show one operation, such as one invocation of a virtual function that is not working. But it must be a complete code that someone else could compile and run. – Robert Crovella Feb 24 '16 at 14:48
0

Robert's suggestion seems to made it work:

__global__ void createAProduction(DeviceProduction** production) {
    int idx = threadIdx.x+blockDim.x*blockIdx.x;
    if(idx == 0) {
        production[0] = new AProduction();
    }   
}

Called like this:

DeviceProduction ** devAProduction = NULL;
cudaMalloc(&devAProduction, sizeof(AProduction *));
createAProduction<<<1, 1>>>(devAProduction);
gpuErrchk( cudaPeekAtLastError() );
gpuErrchk( cudaDeviceSynchronize() );

But if I want to keep single pointer structure for deviceProductions array would it be ok to do sth. like this?

deviceProductions["A"] = (DeviceProduction *) malloc(sizeof(AProduction *));
gpuErrchk(cudaMemcpy(deviceProductions["A"], devAProduction, sizeof(AProduction *), cudaMemcpyDeviceToHost));

My intention was to copy the pointer (address) to the host memory from the device memory. Am I doing it right?

kboom
  • 2,279
  • 3
  • 28
  • 43