cuda: allocating array of pointers to images and the images

Question

I'm new to CUDA. I want to copy stack of images to the device.

unsigned char** _devStackImagesCuda = NULL;
int stackSize = 5;//should be replaced by argument to the function
if(_devStackImagesCuda == NULL)\\allocate array of pointers on the device
{
    cudaMalloc(&_devStackImagesCuda,  sizeof(unsigned char*)  * stackSize);
    cudaMemset(_devStackImagesCuda, 0, sizeof(unsigned char*) * stackSize);
}

for(int i = 0; i < stackSize; i++)
{

    if(_devStackImagesCuda[i] == NULL) //allocates one image on the device.
        cudaMalloc(&_devStackImagesCuda[i], imageSize * sizeof(unsigned char));
    cudaMemcpy(_devStackImagesCuda[i], _imageStack->GetImage(i, dummy, true), imageSize, cudaMemcpyHostToDevice);//copy image data to device
}

Is it OK?

So this is a debugging question? Perhaps you could change the title to reflect this? Are you sure you haven't just edited the source file for the program without recompiling it? — talonmies, Feb 27 '13 at 14:12
How about some [error checking](http://stackoverflow.com/questions/14038589/what-is-the-canonical-way-to-check-for-errors-using-the-cuda-runtime-api) on your cuda calls? For an arrangement of pointers to pointers to be accessible on the device, you cannot cudaMalloc the top level pointer and then cudaMalloc pointers beneath it. It's necessary to create a pointer arrangement on the host and then copy it to the device. See [this question/answer](http://stackoverflow.com/questions/14790999/how-to-pass-a-c-class-with-array-of-pointers-to-cuda/14791979#14791979) for an example. — Robert Crovella, Feb 27 '13 at 14:13
Also, this line: `if(_devStackImagesCuda[i] == NULL)` is dereferencing a device pointer in host code. This is illegal and won't work (segfault). So this method of trying to determine whether or not a device pointer needs to be cudaMalloc'ed by testing if the device pointer is zero (on the host) is a bad idea. If you really wanted to, you could cudaMemcpy the pointer back to the host before testing if it is zero, but I would look for a better way to get my allocations done. — Robert Crovella, Feb 27 '13 at 14:38

score 2 · Accepted Answer · edited May 23 '17 at 12:12

As indicated in the comments there are several problems with your approach.

As a beginner especially, you should always do error checking on your cuda calls (including kernel calls). In the code I have below is an example, or refer to this question/answer
Creating a pointer-to-pointer arrangement in cuda is sometimes not intuitive, because the approach of cudaMalloc'ing the top level pointer and then cudaMalloc'ing the pointers underneath it will not work. This is because to cudaMalloc the pointers underneath it, we must pass the top level pointer to cudaMalloc, but this is already a device pointer. cudaMalloc expects you to pass a host pointer that it will then cudaMalloc to be on the device. So to address this it's usually necessary to create a shadow or parallel pointer arrangement on the host, and pass all these to cudaMalloc in succession, and then copy these pointers to the device. See my code below for an example.
You also wanted to test for the validity of a device pointer on the host to see if you needed to cudaMalloc it. This won't work as it leads to dereferencing a device pointer on the host. Specifically at this line: if(_devStackImagesCuda[i] == NULL), you are trying to see if _devStackImagesCuda[i] is valid, but in order to do this you must dereference _devStackImagesCuda however you have previously done a cudaMalloc on this pointer (to a pointer) and so it is now a device pointer, which you are not allowed to dereference on the host. I suggest you keep track of whether you need to cudaMalloc these pointers some other way.

I believe something like this will work:

#include <stdio.h>

#define cudaCheckErrors(msg) \
    do { \
        cudaError_t __err = cudaGetLastError(); \
        if (__err != cudaSuccess) { \
            fprintf(stderr, "Fatal error: %s (%s at %s:%d)\n", \
                msg, cudaGetErrorString(__err), \
                __FILE__, __LINE__); \
            fprintf(stderr, "*** FAILED - ABORTING\n"); \
            exit(1); \
        } \
    } while (0)


int main(){

  unsigned char ** _devStackImagesCuda=0;

  int stackSize = 5;
  int imageSize = 4;
  unsigned char *temp[stackSize];
  unsigned char dummy_image[imageSize];
// first create top level pointer
  if ( _devStackImagesCuda == 0) //allocate array of pointers on the device
    {
    cudaMalloc(&_devStackImagesCuda,  sizeof(unsigned char*)  * stackSize);
    cudaCheckErrors("cm 1");
    }
// then create child pointers on host, and copy to device, then copy image
  for(int i = 0; i < stackSize; i++)
    {

    cudaMalloc(&temp[i], imageSize * sizeof(unsigned char));
    cudaCheckErrors("cm 2");
    cudaMemcpy(&(_devStackImagesCuda[i]), &(temp[i]), sizeof(unsigned char *), cudaMemcpyHostToDevice);//copy child pointer to device
    cudaCheckErrors("cudamemcopy1");
    cudaMemcpy(temp[i], dummy_image, imageSize*sizeof(unsigned char), cudaMemcpyHostToDevice); // copy image to device
    cudaCheckErrors("cudamemcpy2");

    }


return 0;
}

By the way, you could simplify things quite a bit if you can treat your array of images as a contiguous region. Like so:

unsigned char images[NUM_IMAGES*IMAGE_SIZE]; // or you could malloc this
unsigned char *d_images;
cudaMalloc((void **) d_images, NUM_IMAGES*IMAGE_SIZE*sizeof(unsigned char));
cudaMemcpy(d_images, images, NUM_IMAGES*IMAGE_SIZE*sizeof(unsigned char), cudaMemcpyHostToDevice);

and access individual image elements by:

unsigned char mypixel = images[i + (IMAGE_SIZE * j)]; // to access element i in image j

Thanks a lot !! As you suggested it is not intuitive to me.. Is it possible to use your example even with unsigned char** temp or should it be unsigned char *temp[MAX_SIZE_FOR_STACK_SIZE] ? — idoo, Feb 27 '13 at 15:21
I think unsigned char ** temp should work also. In that sense, the two representations of temp are the same, aren't they (both are pointers to pointers to unsigned char)? Why don't you try whichever you prefer and see if you can make it work. — Robert Crovella, Feb 27 '13 at 15:27
Thanks a lot again ! just to check i understand the logic: for the unsigned char** temp version: 1. i should malloc (not cudaMalloc): temp = new unsigned char*[stackSize]. 2. loop for(i=0;i — idoo, Feb 27 '13 at 15:34
yes to make **temp equivalent to temp[stackSize] you would need to malloc with **temp. The rest should be the same. — Robert Crovella, Feb 27 '13 at 15:37

cuda: allocating array of pointers to images and the images

1 Answers1

Linked