Any time you're having trouble with a CUDA code, it's a good idea to use proper cuda error checking and run your code with cuda-memcheck
. Even if you don't understand the error output, it will be useful for those trying to help you, so I suggest doing that before asking for help here.
My attempt to make a complete code out of what you haven shown was like this:
#include <stdio.h>
typedef unsigned char uint40[5];
void aSimpleFunction(int M, uint40* data)
{
for (int i = 0; i < M; i++)
{
unsigned int* dataPtr = (unsigned int*)data[i];
*dataPtr = 0x02020202U;
data[i][4] = 1;
}
}
void uint40_print(uint40 &data){
char *my_data = (char *)&data;
for (int i = 0; i < 5; i++) printf("%d", my_data[i]);
printf("\n");
}
__global__ void aSimpleKernel(int M, uint40* data)
{
for (int i = 0; i < M; i++)
{
unsigned int* dataPtr = (unsigned int*)data[i];
printf("%p\n", dataPtr);
*dataPtr = 0x02020202U;
data[i][4] = 1;
}
}
int main(){
uint40 *payloads = (uint40 *)malloc(10000);
memset(payloads, 0, 10000);
aSimpleFunction(5, payloads);
uint40_print(payloads[0]);
memset(payloads, 0, 10000);
uint40 *d_payloads;
cudaMalloc(&d_payloads, 10000);
aSimpleKernel<<<1,1>>>(5, d_payloads);
cudaMemcpy(payloads, d_payloads, 10000, cudaMemcpyDeviceToHost);
for (int i = 0; i < 5; i++) uint40_print(payloads[i]);
return 0;
}
When I compile and run that code I get output like this:
$ ./t1091
22221
00000
$
sure enough, the GPU output doesn't match the CPU output. If I run the code with cuda-memcheck
, a portion of the output I get looks like this:
$ cuda-memcheck ./t1091
========= CUDA-MEMCHECK
22221
========= Invalid __global__ write of size 4
========= at 0x00000080 in /home/bob/misc/t1091.cu:28:aSimpleKernel(int, unsigned char[5]*)
========= by thread (0,0,0) in block (0,0,0)
========= Address 0x402500005 is misaligned
And this gives a clue to the actual problem. In effect you are creating a char
array and then superimposing a 5-byte-wide structure (uint40) on it. This means that consecutive uint40
items will start at byte addresses that differ by 5.
When you take one of these addresses and cast it to a int
or unsigned int
pointer, you may end up with a misaligned pointer. CUDA requires all accesses of POD data types to occur on naturally aligned boundaries. So a 32-bit quantity (e.g. int
, float
, etc.) must be accessed on a 4-byte boundary (0, 4, 8, ...). Many of the 5-byte boundaries for uint40
(0, 5, 10, ...) don't also fall on 4-byte boundaries, so attempting to access a 4-byte quantity that way is illegal.
One possible solution, for this particular usage example, and assuming the pointer you pass to the kernel is a pointer that is returned by cudaMalloc
(for alignment), is just to change your typedef:
typedef unsigned char uint40[8];
This forces every uint40
item to fall on an 8-byte boundary, which is also a 4-byte boundary. A side effect of this would be allocating 3 unused bytes out of every 8 allocated.
In your case, you indicated that the uint40
type was a collection of data, not a single numerical quantity, so it is effectively a data "structure" that happens to occupy 5 bytes per element. An array of such "structures" would effectively be AoS (array of structures) storage format, and a common transformation on such data for performance is to convert it to an SoA (structure of arrays) storage format. Therefore another possible approach would be to create two arrays:
typedef unsigned char uint40a[4];
typedef unsigned char uint40b[1];
uint40a *data1;
uint40b *data2;
cudaMalloc(&data1, size);
cudaMalloc(&data2, size);
and access your data in this fashion. This will maintain the storage density and almost certainly provide faster access to your data in the GPU as compared to your 5-byte structure.
If there is any doubt from the above, you cannot pick up an arbitrary char
pointer, cast it to another (larger) datatype, and expect good things to happen. The pointers you use must be properly aligned for the datatype being referenced.