I'm not sure I would focus on 2D arrays if you're just starting to learn cuda.
Also curious if you manually typed your code into the question, because you have a threads_per_block
variable defined, but then you use threads_per_blocks
in the kernel invocation.
Anyway, there are several problems with your code:
- when using 2D arrays, it's almost always necessary to pass the pitch
parameter (in some fashion) to the kernel.
cudaMallocPitch
allocates arrays with extra padding on the end of each row, so that
the next row starts at a nicely aligned boundary. This will usually
result in allocation granularity of 128 or 256 bytes. So your first
row has 3 valid data entities followed by enough empty space to fill
up, say 256 bytes (equal to whatever your pitch variable is). So we have to change the kernel invocation and the kernel itself to account for this.
- Your kernel is inherently a 1D kernel (it does not comprehend or use
threadIdx.y
, for example). Therefore there's not point in launching a 2D grid. Although it doesn't hurt anything in this case, it's creating redundancy which can be confusing and troublesome in other codes.
Here's an updated code showing some changes that will give you expected results, based on the above comments:
#include <stdio.h>
__global__ void doStuff(float* data, float* result, size_t dpitch, size_t rpitch, int width)
{
if (threadIdx.x < 9) // take the first 9 threads
{
int index = threadIdx.x;
result[((index/width)*(rpitch/sizeof(float)))+ (index%width)] = (float) index;
}
}
int main(void)
{
/*
Setup
*/
float simple[] = {-1.0, -2.0, -3.0, -4.0, -5.0, -6.0, -7.0, -8.0, -9.0};
float* data_array;
float* result_array;
size_t data_array_pitch, result_array_pitch;
int height = 3;
int width = 3;
int width_in_bytes = width * sizeof(float);
/*
Initialize GPU arrays
*/
cudaMallocPitch(&data_array, &data_array_pitch, width_in_bytes, height);
cudaMallocPitch(&result_array, &result_array_pitch, width_in_bytes, height);
/*
Copy data to GPU
*/
cudaMemcpy2D(data_array, data_array_pitch, simple, width_in_bytes, width_in_bytes, height, cudaMemcpyHostToDevice);
dim3 threads_per_block(16);
dim3 num_blocks(1,1);
/*
Do stuff
*/
doStuff<<<num_blocks, threads_per_block>>>(data_array, result_array, data_array_pitch, result_array_pitch, width);
/*
Get the results
*/
cudaMemcpy2D(simple, width_in_bytes, result_array, result_array_pitch, width_in_bytes, height, cudaMemcpyDeviceToHost);
for (int i = 1; i <= 9; ++i)
{
printf("%f ", simple[i-1]);
if(!(i%3))
printf("\n");
}
return 0;
}
You might also find this question interesting reading.
EDIT: responding to a question in the comments:
result[((index/width)*(rpitch/sizeof(float)))+ (index%width)] = (float) index;
1 2 3
To compute the correct element index into the pitched array we must:
- Compute the (virtual) row index from the thread index. We do this by taking integer division of the thread index by the width of each (non-pitched) row (in elements, not bytes).
- Multiply the row index by the width of each pitched row. The width of each pitched row is given by the pitched parameter, which is in bytes. To convert this pitched byte parameter into a pitched element parameter, we divide by the size of each element. Then by multiplying the quantity by the row index computed in step 1, we have now indexed into the correct row.
- Compute the (virtual) column index from the thread index by taking the remainder (modulo division) of the thread index divided by the width (in elements). Once we have the column index (in elements) we add it to the start-of-the-correct-row index computed in step 2, to identify the element that this thread will be responsible for.
The above is a fair amount of effort for a relatively straightforward operation, which is one example of why I suggest focusing on basic cuda concepts rather than pitched arrays first. For example I would figure how to handle 1 and 2D thread blocks, and 1 and 2D grids, before tackling pitched arrays. Pitched arrays are a useful performance enhancer for accessing 2D arrays (or 3D arrays) in some instances, but they are by no means necessary to handle multidimensional arrays in CUDA.