As Robert pointed out in comments, this is a pretty common question which comes up regularly, and my rather old answer highlights most of the important points, although it probably isn't the canonical example we probably should have.
The really short answer is that you need to build the array of device pointers in host memory first, and then copy that array to the device. Turning your code into a trivial example to allocate memory on the device, gets you something like this:
class Image{
public:
int rows;
int cols;
int ** pixels; //2D array
__host__ __device__
Image() {};
__host__ __device__
Image(int r, int c, int** p) : rows(r), cols(c), pixels(p) {};
};
__global__ void intialiseImage(Image image, const int p_val)
{
int i = blockIdx.y*blockDim.y + threadIdx.y;
int j = blockIdx.x*blockDim.x + threadIdx.x;
if (i < image.rows && j < image.cols) {
image.pixels[i][j] = p_val;
}
}
int** makeDeviceImage(Image& dev_image, const int rows, const int cols)
{
int** h_pixels = new int*[rows];
for(int i=0; i<rows; i++) {
cudaMalloc((void **)&h_pixels[i], sizeof(int) * size_t(cols));
}
int** d_pixels;
cudaMalloc((void**)&d_pixels, sizeof(int*) * size_t(rows));
cudaMemcpy(d_pixels, &h_pixels[0], sizeof(int*) * size_t(rows), cudaMemcpyHostToDevice);
dev_image = Image(rows, cols, d_pixels);
return h_pixels;
}
int main(void)
{
int rows = 16, cols = 32;
Image dev_image;
int** dev_pixels = makeDeviceImage(dev_image, rows, cols);
intialiseImage<<<rows, cols>>>(dev_image, 128);
cudaDeviceSynchronize();
cudaDeviceReset();
return 0;
}
I will leave the copy code as an exercise for the reader (hint: the array of pointers the function returns is extremely useful there), but there is one comment worth making. Have a look at this profiler output for that code:
>nvprof a.exe
==5148== NVPROF is profiling process 5148, command: a.exe
==5148== Profiling application: a.exe
==5148== Profiling result:
Time(%) Time Calls Avg Min Max Name
75.82% 2.2070us 1 2.2070us 2.2070us 2.2070us intialiseImage(Image, int)
24.18% 704ns 1 704ns 704ns 704ns [CUDA memcpy HtoD]
==5148== API calls:
Time(%) Time Calls Avg Min Max Name
99.33% 309.01ms 17 18.177ms 20.099us 308.62ms cudaMalloc
0.50% 1.5438ms 83 18.599us 427ns 732.97us cuDeviceGetAttribute
0.07% 202.70us 1 202.70us 202.70us 202.70us cuDeviceGetName
0.04% 136.84us 1 136.84us 136.84us 136.84us cudaDeviceSynchronize
0.03% 92.370us 1 92.370us 92.370us 92.370us cudaMemcpy
0.02% 76.974us 1 76.974us 76.974us 76.974us cudaLaunch
0.01% 24.375us 1 24.375us 24.375us 24.375us cuDeviceTotalMem
0.00% 5.5580us 2 2.7790us 2.5650us 2.9930us cuDeviceGetCount
0.00% 4.2760us 1 4.2760us 4.2760us 4.2760us cudaConfigureCall
0.00% 3.4220us 2 1.7110us 856ns 2.5660us cudaSetupArgument
0.00% 2.5660us 2 1.2830us 1.2830us 1.2830us cuDeviceGet
On the platform I tested (Windows 8, mobile class Fermi GPU), the kernel to write a value into the image takes about 2us. A cudaMalloc
call takes at least 20us. And there are 17 malloc calls to allocate this trivially small array. The overhead of working with arrays of pointers in CUDA is significant and I wouldn't recommend it if performance is your first priority.