CUDA: 2D array indexing giving unexpected results

Question

I started learning CUDA, and I wanted to write a simple program that copied some data to the GPU, modified it, and transferred it back. I've already googled around and tried to find my mistake. I'm pretty sure that the problem is in my kernel, but I'm not completely sure what is wrong.

Here's my kernel:

__global__ void doStuff(float* data, float* result)
{
    if (threadIdx.x < 9) // take the first 9 threads
    {
        int index = threadIdx.x;
        result[index] = (float) index;
    }
}

And here are the relevant parts of my main:

#include <stdlib.h>
#include <stdio.h>

int main(void)
{
    /*
        Setup
    */
    float simple[] = {-1.0, -2.0, -3.0, -4.0, -5.0, -6.0, -7.0, -8.0, -9.0};

    float* data_array;
    float* result_array;

    size_t data_array_pitch, result_array_pitch;
    int width_in_bytes = 3 * sizeof(float);
    int height = 3;

    /*
        Initialize GPU arrays
    */
    cudaMallocPitch(&data_array, &data_array_pitch, width_in_bytes, height);
    cudaMallocPitch(&result_array, &result_array_pitch, width_in_bytes, height);

    /*
        Copy data to GPU
    */
    cudaMemcpy2D(data_array, data_array_pitch, simple, width_in_bytes, width_in_bytes, height, cudaMemcpyHostToDevice);

    dim3 threads_per_block(16, 16);
    dim3 num_blocks(1,1);

    /*
        Do stuff
    */
    doStuff<<<num_blocks, threads_per_blocks>>>(data_array, result_array);

    /*
        Get the results
    */
    cudaMemcpy2D(simple, width_in_bytes, result_array, result_array_pitch, width_in_bytes, height, cudaMemcpyDeviceToHost);

    for (int i = 1; i <= 9; ++i)
    {
        printf("%f ", simple[i-1]);
        if(!(i%3))
            printf("\n");
    }

    return 0;
}

When I run this I get 0.000000 1.000000 2.00000 for the first row and garbage for the other two.

If you do [cuda error checking](http://stackoverflow.com/questions/14038589/what-is-the-canonical-way-to-check-for-errors-using-the-cuda-runtime-api) on all cuda API calls and kernel calls, do you get any errors? What happens when you run your code with `cuda-memcheck` ? — Robert Crovella, Jun 03 '13 at 21:16
Do I need to account for the pitch when accessing elements in the array? I'm looking at page 30 of NVIDIA's guide now. — , Jun 03 '13 at 21:26

score 3 · Accepted Answer · edited May 23 '17 at 12:28

I'm not sure I would focus on 2D arrays if you're just starting to learn cuda.

Also curious if you manually typed your code into the question, because you have a threads_per_block variable defined, but then you use threads_per_blocks in the kernel invocation.

Anyway, there are several problems with your code:

when using 2D arrays, it's almost always necessary to pass the pitch parameter (in some fashion) to the kernel. cudaMallocPitch allocates arrays with extra padding on the end of each row, so that the next row starts at a nicely aligned boundary. This will usually result in allocation granularity of 128 or 256 bytes. So your first row has 3 valid data entities followed by enough empty space to fill up, say 256 bytes (equal to whatever your pitch variable is). So we have to change the kernel invocation and the kernel itself to account for this.
Your kernel is inherently a 1D kernel (it does not comprehend or use threadIdx.y, for example). Therefore there's not point in launching a 2D grid. Although it doesn't hurt anything in this case, it's creating redundancy which can be confusing and troublesome in other codes.

Here's an updated code showing some changes that will give you expected results, based on the above comments:

#include <stdio.h>


__global__ void doStuff(float* data, float* result, size_t dpitch, size_t rpitch, int width)
{
    if (threadIdx.x < 9) // take the first 9 threads
    {
        int index = threadIdx.x;
        result[((index/width)*(rpitch/sizeof(float)))+ (index%width)] = (float) index;
    }
}

int main(void)
{
    /*
        Setup
    */
    float simple[] = {-1.0, -2.0, -3.0, -4.0, -5.0, -6.0, -7.0, -8.0, -9.0};

    float* data_array;
    float* result_array;

    size_t data_array_pitch, result_array_pitch;
    int height = 3;
    int width = 3;
    int width_in_bytes = width * sizeof(float);

    /*
        Initialize GPU arrays
    */
    cudaMallocPitch(&data_array, &data_array_pitch, width_in_bytes, height);
    cudaMallocPitch(&result_array, &result_array_pitch, width_in_bytes, height);

    /*
        Copy data to GPU
    */
    cudaMemcpy2D(data_array, data_array_pitch, simple, width_in_bytes, width_in_bytes, height, cudaMemcpyHostToDevice);

    dim3 threads_per_block(16);
    dim3 num_blocks(1,1);

    /*
        Do stuff
    */
    doStuff<<<num_blocks, threads_per_block>>>(data_array, result_array, data_array_pitch, result_array_pitch, width);

    /*
        Get the results
    */
    cudaMemcpy2D(simple, width_in_bytes, result_array, result_array_pitch, width_in_bytes, height, cudaMemcpyDeviceToHost);

    for (int i = 1; i <= 9; ++i)
    {
        printf("%f ", simple[i-1]);
        if(!(i%3))
            printf("\n");
    }
    return 0;
}

You might also find this question interesting reading.

EDIT: responding to a question in the comments:

result[((index/width)*(rpitch/sizeof(float)))+ (index%width)] = (float) index;
              1               2                      3

To compute the correct element index into the pitched array we must:

Compute the (virtual) row index from the thread index. We do this by taking integer division of the thread index by the width of each (non-pitched) row (in elements, not bytes).
Multiply the row index by the width of each pitched row. The width of each pitched row is given by the pitched parameter, which is in bytes. To convert this pitched byte parameter into a pitched element parameter, we divide by the size of each element. Then by multiplying the quantity by the row index computed in step 1, we have now indexed into the correct row.
Compute the (virtual) column index from the thread index by taking the remainder (modulo division) of the thread index divided by the width (in elements). Once we have the column index (in elements) we add it to the start-of-the-correct-row index computed in step 2, to identify the element that this thread will be responsible for.

The above is a fair amount of effort for a relatively straightforward operation, which is one example of why I suggest focusing on basic cuda concepts rather than pitched arrays first. For example I would figure how to handle 1 and 2D thread blocks, and 1 and 2D grids, before tackling pitched arrays. Pitched arrays are a useful performance enhancer for accessing 2D arrays (or 3D arrays) in some instances, but they are by no means necessary to handle multidimensional arrays in CUDA.

Um...yes, I did type it manually. Sorry about that. Can you explain the `result[...]` line in a little bit more detail? — , Jun 03 '13 at 21:47

score 0 · Answer 2 · answered Jun 03 '13 at 22:09

Actually it can also be done by replacing the line

int width_in_bytes = 3 * sizeof(float);

by:

int width_in_bytes = sizeof(float)*9;

because this is the parameter that tells cudaMemcpy2D how many bytes to copy from src to dst, in the first code you ask to copy 3 float numbers, but the array you want to copy has length 9, so the width you require is the size of 9 float numbers.

Although this solution works there are still some inefficiencies in your code; for example if you really want that the first 9 threads of the block do something, in the 'if' you should add the following condition with an and(&&)

threadIdx.y==0

CUDA: 2D array indexing giving unexpected results

2 Answers2

Linked