0

Disclaimer: I'm not ENTIRELY lost here, but I just need some guidance. I'm working with an image that is stored pixel by pixel within a 2D array. The array is a data member of an Image class. This program works perfect as a serial program. Anyhow...

class Image{
    int rows;
    int cols;
    int ** pixels; //2D array
};

The pixels are stored in this format: pixels[rows][cols]

I know I can't access data members within __global__ Cuda functions, which is where I'm stuck. I need to:

1) Access the data member (pixels)
2) Copy everything to Cuda 
3) Do work on it
4) Get it all back
5) Store it back into pixels

So my question is, how do I copy and use that data within my Cuda function?

Here that is:

__global__ void cuda_negate_image(int ** new_array, int ** old_array, int rows, int cols){

    int i = blockIdx.y*blockDim.y + threadIdx.y;
    int j = blockIdx.x*blockDim.x + threadIdx.x;

    if (i < rows && j < cols) {
        new_array[i][j] = -(old_array[i][j]) + 255;
    }
}

I know how to work with pointers, but not pointers to pointers :(.

Mmm Donuts
  • 9,551
  • 6
  • 27
  • 49
  • You first need to copy your image to a device pointer. Look for cudaMalloc and cudaMemCpy. You pass the device pointer to the kernel. – Joan Charmant Nov 14 '15 at 09:11
  • The answer given by @talonmies [here](http://stackoverflow.com/questions/6137218/how-can-i-add-up-two-2d-pitched-arrays-using-nested-for-loops) is the canonical answer, in my opinion, to how to access a double-pointer array in a CUDA kernel. I would suggest that your question appears to be basically a duplicate of that one, in that the answer there should essentially answer your question. It's probably considerably easier if you flatten the "2D" (doubly-subscripted) array to a single-pointer array, and access it with index arithmetic in your kernel such as `new_array[i*cols+j]` – Robert Crovella Nov 15 '15 at 03:27

1 Answers1

1

As Robert pointed out in comments, this is a pretty common question which comes up regularly, and my rather old answer highlights most of the important points, although it probably isn't the canonical example we probably should have.

The really short answer is that you need to build the array of device pointers in host memory first, and then copy that array to the device. Turning your code into a trivial example to allocate memory on the device, gets you something like this:

class Image{
    public:

    int rows;
    int cols;
    int ** pixels; //2D array

    __host__ __device__
    Image() {};
    __host__ __device__
    Image(int r, int c, int** p) : rows(r), cols(c), pixels(p) {};
};

__global__ void intialiseImage(Image image, const int p_val)
{
    int i = blockIdx.y*blockDim.y + threadIdx.y;
    int j = blockIdx.x*blockDim.x + threadIdx.x;

    if (i < image.rows && j < image.cols) {
        image.pixels[i][j] = p_val;
    }
}

int** makeDeviceImage(Image& dev_image, const int rows, const int cols)
{
    int** h_pixels = new int*[rows];
    for(int i=0; i<rows; i++) {
        cudaMalloc((void **)&h_pixels[i], sizeof(int) * size_t(cols));
    }
    int** d_pixels;
    cudaMalloc((void**)&d_pixels, sizeof(int*) * size_t(rows));
    cudaMemcpy(d_pixels, &h_pixels[0], sizeof(int*) * size_t(rows), cudaMemcpyHostToDevice);

    dev_image = Image(rows, cols, d_pixels);

    return h_pixels;
}


int main(void)
{
    int rows = 16, cols = 32;

    Image dev_image;
    int** dev_pixels = makeDeviceImage(dev_image, rows, cols);

    intialiseImage<<<rows, cols>>>(dev_image, 128);
    cudaDeviceSynchronize();
    cudaDeviceReset();

    return 0;
}

I will leave the copy code as an exercise for the reader (hint: the array of pointers the function returns is extremely useful there), but there is one comment worth making. Have a look at this profiler output for that code:

>nvprof a.exe
==5148== NVPROF is profiling process 5148, command: a.exe
==5148== Profiling application: a.exe
==5148== Profiling result:
Time(%)      Time     Calls       Avg       Min       Max  Name
 75.82%  2.2070us         1  2.2070us  2.2070us  2.2070us  intialiseImage(Image, int)
 24.18%     704ns         1     704ns     704ns     704ns  [CUDA memcpy HtoD]

==5148== API calls:
Time(%)      Time     Calls       Avg       Min       Max  Name
 99.33%  309.01ms        17  18.177ms  20.099us  308.62ms  cudaMalloc
  0.50%  1.5438ms        83  18.599us     427ns  732.97us  cuDeviceGetAttribute
  0.07%  202.70us         1  202.70us  202.70us  202.70us  cuDeviceGetName
  0.04%  136.84us         1  136.84us  136.84us  136.84us  cudaDeviceSynchronize

  0.03%  92.370us         1  92.370us  92.370us  92.370us  cudaMemcpy
  0.02%  76.974us         1  76.974us  76.974us  76.974us  cudaLaunch
  0.01%  24.375us         1  24.375us  24.375us  24.375us  cuDeviceTotalMem
  0.00%  5.5580us         2  2.7790us  2.5650us  2.9930us  cuDeviceGetCount
  0.00%  4.2760us         1  4.2760us  4.2760us  4.2760us  cudaConfigureCall
  0.00%  3.4220us         2  1.7110us     856ns  2.5660us  cudaSetupArgument
  0.00%  2.5660us         2  1.2830us  1.2830us  1.2830us  cuDeviceGet

On the platform I tested (Windows 8, mobile class Fermi GPU), the kernel to write a value into the image takes about 2us. A cudaMalloc call takes at least 20us. And there are 17 malloc calls to allocate this trivially small array. The overhead of working with arrays of pointers in CUDA is significant and I wouldn't recommend it if performance is your first priority.

talonmies
  • 70,661
  • 34
  • 192
  • 269