CUDA kernel and 2D arrays - how does it work?

Question

I am doing an image rotation method. It takes two matricies and a degree of rotation. It rotates original matrix by amount of degrees and stores it into rotated matrix. I have the following "normal" code (for CPU - taken from this site - http://sinepost.wordpress.com/2012/07/24/image-rotation/) and it is working as it should;

static void RotateImage(unsigned char original[RAW_HEIGHT][RAW_WIDTH] , unsigned char rotated[RAW_HEIGHT][RAW_WIDTH] , int degrees)
{
    double centerX = RAW_WIDTH/2;
    double centerY = RAW_HEIGHT/2;

    for(int x = 0; x< RAW_HEIGHT;x++)
    {
        for (int y = 0; y < RAW_WIDTH; y++)
        {
            double dir = calculateDirection(x-centerX,y-centerY);
            double mag = calculateMagnitude(x-centerX,y-centerY);

            dir-=degrees;

            int origX = (int)(centerX + calculateX(dir,mag));
            int origY = (int)(centerY + calculateY(dir,mag));

            if (origX >= 0 && origX < RAW_HEIGHT && origY >= 0 && origY < RAW_WIDTH)
            {
                    rotated[x][y] = original[origX][origY];
            }
        }
    }
}

I would like to transfer this code to CUDA code. Here is my version:

#define RAW_WIDTH 1600*3
#define RAW_HEIGHT 1200

unsigned char *dev_original_image;
unsigned char *dev_rotated_image;

__global__ void rotatePicture(unsigned char *original, unsigned char *rotated, int degrees)
{
    int x = threadIdx.x + blockDim.x * blockIdx.x;
    int y = threadIdx.y + blockDim.y * blockIdx.y;
    int offset_rotated = x + y * blockDim.x * gridDim.x;

    double centerX = 2400.0;
    double centerY = 600.0;

    double dir = (atan2(y-centerY,x-centerX))*180/3.14159265;
    double mag = sqrt((x-centerX)*(x-centerX) + (y-centerY)*(y-centerY));

    dir = dir - degrees;

    int origX = (int)(centerX + cos((dir*3.14159265/180)) * mag);
    int origY = (int)(centerY + sin((dir*3.14159265/180)) * mag);
    int offset_original = origX + origY * blockDim.x * gridDim.x;

    if(offset_original > 0 && offset_original < RAW_HEIGHT*RAW_WIDTH)
        *(rotated + offset_rotated) = *(original + offset_original);
}

but it doesn't give me same result as CPU part. I think the problem is in passing arguments of CUDA kerenl. I am passing them as 2D arrays, is this ok? Can someone explain this to me? Here is my kerenl configuration and call:

dim3 BlockPerGrid(450,400,1);
dim3 ThreadsPerGrid(8,4,1);

cudaMalloc((void**)&dev_original_image,sizeof(unsigned char)*RAW_HEIGHT*RAW_WIDTH);
cudaMalloc((void**)&dev_rotated_image,sizeof(unsigned char)*RAW_HEIGHT*RAW_WIDTH);

cudaMemcpy(dev_original_image, raw_image2D, sizeof(unsigned char)*RAW_HEIGHT*RAW_WIDTH,cudaMemcpyHostToDevice);
cudaMemcpy(dev_rotated_image, raw_image2D_rotated, sizeof(unsigned char)*RAW_HEIGHT*RAW_WIDTH, cudaMemcpyHostToDevice);

rotatePicture<<<BlockPerGrid,ThreadsPerGrid>>>(dev_original_image,dev_rotated_image, deg);

Thank you for your advices!

NOTE: I modified my code and is working better but still not correct.

Are you doing [cuda error checking](http://stackoverflow.com/questions/14038589/what-is-the-canonical-way-to-check-for-errors-using-the-cuda-runtime-api) on all your cuda calls? What are the sizes of `RAW_HEIGHT` and `RAW_WIDTH` ? How about your definition of `dev_original_image` and `dev_rotated_image`, are those just `unsigned char *` or some other type? — Robert Crovella, Feb 18 '13 at 18:16
The code should not compile. `dev_original_image` and `dev_rotated_image` have single level of indirection while the arguments of the kernel have 2 levels of indirection. — sgarizvi, Feb 18 '13 at 18:16
@ Robert Crovella I added those line that you specified @sgar91 it compiles bot not working as I imagined — Antun Tun, Feb 18 '13 at 18:27
You may be interested in [this question/answer](http://stackoverflow.com/questions/12924155/sending-3d-array-to-cuda-kernel/12925014#12925014). The first code sample I give shows how one might pass a multidimensional array to a cuda kernel. — Robert Crovella, Feb 18 '13 at 18:27
@ Robert Can you please tell me do I need to use cudaMAallocPitch() with cudaMemcpy2D() or this way is good? And where to learn cudaMallocPitch() /cudaMemcpy2D()? Thanks — Antun Tun, Feb 18 '13 at 18:30
You don't need to use cudaMallocPitch/cudaMemcpy2D. You should be able to get it working this way. If you use cudaMallocPitch/cudaMemcpy2D it will complicate your array pointer calculations, but may have the opportunity for better coalescing (faster performance). The reference docs are a good place to learn about [cudaMallocPitch](http://docs.nvidia.com/cuda/cuda-runtime-api/index.html#group__CUDART__MEMORY_1g80d689bc903792f906e49be4a0b6d8db) and [cudaMemcpy2D](http://docs.nvidia.com/cuda/cuda-runtime-api/index.html#group__CUDART__MEMORY_1g17f3a55e8c9aef5f90b67cdf22851375). — Robert Crovella, Feb 18 '13 at 18:41
Can you just help me about one more thing, is it necessary to use syncThredas() or cudaDeviceSync() in order to make this work? Thanks — Antun Tun, Feb 18 '13 at 19:02
Since your output array is different than the input array, and you're not making use of shared memory, there should be no need for specific synchronization. I'm a bit puzzled about your thread strategy. It appears that the strategy is that each thread will be responsible for one output element. It appears that your total elements are 4800 (x-horizontal) by 1200 (y-vertical) but you are launching 8*450=3600 threads (x-horizontal) by 4*400=1600 threads (y-vertical) ? — Robert Crovella, Feb 18 '13 at 19:16

score 1 · Accepted Answer · answered Feb 18 '13 at 22:04

Here is the solution to other lurking in these waters. Here is my correct kernel:

__global__ void rotatePicture(unsigned char *original, unsigned char *rotated, int degrees)
{
    int x = threadIdx.x + blockDim.x * blockIdx.x;
    int y = threadIdx.y + blockDim.y * blockIdx.y;
    int offset_rotated = x + y * blockDim.x * gridDim.x;

    double centerX = 2400.0;
    double centerY = 600.0;

    double dir = (atan2(x-centerX,y-centerY))*180/3.14159265;
    double mag = sqrt((x-centerX)*(x-centerX) + (y-centerY)*(y-centerY));

    dir = dir - degrees;

    int origX = (int)(centerX + sin((dir*3.14159265/180)) * mag);
    int origY = (int)(centerY + cos((dir*3.14159265/180)) * mag);
    int offset_original = origX + origY * blockDim.x * gridDim.x;

    if(origX > 0 && origX < RAW_WIDTH && origY > 0 && origY < RAW_HEIGHT)
        *(rotated + offset_rotated) = *(original + offset_original);
}

Also, I changed kernel dimensions like this (to accomodate my 1600*3 width and 1200 height):

dim3 BlockPerGrid(600,300,1);
dim3 ThreadsPerGrid(8,4,1);

So, it is functioning the same way as CPU version above but using GPU resources. Enjoy

CUDA kernel and 2D arrays - how does it work?

1 Answers1