1

I have a type defined as typedef unsigned char uint40[5] and then I have an array of uint40, say uint40* payloads

I was trying to port the following function into a CUDA kernel

void aSimpleFunction(int M, uint40* data)
{
    for (auto i = 0; i < M; i++)
    {
        unsigned int* dataPtr = (unsigned int*)data[i];
        *dataPtr = 2158677232;
        data[i][4] = 1;
    }
}

To me it was as simple as but it didn't work. However, using square-brackets to access to each of the elements it does work.

__global__ void aSimpleKernel(int M, uint40* data)
{
    int tid = threadIdx.x + 1;

    // DOESN'T WORK
    unsigned int* dataPtr = (unsigned int*)data[tid];
    *dataPtr = 16976944;
    // WORKS
    /*
    data[threadIdx.x][0] = tid * 1;
    data[threadIdx.x][1] = tid * 2;
    data[threadIdx.x][2] = tid * 3;
    data[threadIdx.x][3] = tid * 4;
    */
    data[threadIdx.x][4] = 2;
}

Is it possible to cast a char* into a unsigned int* in a CUDA kernel?

By "didn't work" I mean, it has random numbers instead of what I really expect when printing each of the elements of the uint40* array. Sometimes, the GPU apparently crashes since there is a pop up in windows telling me the gpu restarted successfully.

BRabbit27
  • 6,333
  • 17
  • 90
  • 161
  • Do you mean that `data[tid]` is not a `char[5]` but a `char`? I might be missing something because in the CPU-version it works perfect but then something I'm doing weird when going to GPU. Any advice? – BRabbit27 Feb 26 '16 at 21:51
  • I think I got confused in the levels of indirection in my original comment. You may want to post MCVE since the error could be in the way `data` is initialized. – void_ptr Feb 26 '16 at 21:52
  • Yeah, I imagined that there was something subtle between GPU and CPU way of accessing stuff. The thing is I want to pack information is as few bites as possible to optimize memory usage. Before I was OK with an `int` but project is changing and need 8 more bits to store other info but we didn't want to go to a bigger type. – BRabbit27 Feb 26 '16 at 22:40

1 Answers1

6

Any time you're having trouble with a CUDA code, it's a good idea to use proper cuda error checking and run your code with cuda-memcheck. Even if you don't understand the error output, it will be useful for those trying to help you, so I suggest doing that before asking for help here.

My attempt to make a complete code out of what you haven shown was like this:

#include <stdio.h>

typedef unsigned char uint40[5];


void aSimpleFunction(int M, uint40* data)
{
    for (int i = 0; i < M; i++)
    {
        unsigned int* dataPtr = (unsigned int*)data[i];
        *dataPtr = 0x02020202U;
        data[i][4] = 1;
    }
}

void uint40_print(uint40 &data){

  char *my_data = (char *)&data;
  for (int i = 0; i < 5; i++) printf("%d", my_data[i]);
  printf("\n");
}

__global__ void aSimpleKernel(int M, uint40* data)
{
    for (int i = 0; i < M; i++)
    {
        unsigned int* dataPtr = (unsigned int*)data[i];
        printf("%p\n", dataPtr);
        *dataPtr = 0x02020202U;
        data[i][4] = 1;
    }
}

int main(){

  uint40 *payloads = (uint40 *)malloc(10000);
  memset(payloads, 0, 10000);
  aSimpleFunction(5, payloads);
  uint40_print(payloads[0]);
  memset(payloads, 0, 10000);
  uint40 *d_payloads;
  cudaMalloc(&d_payloads, 10000);
  aSimpleKernel<<<1,1>>>(5, d_payloads);
  cudaMemcpy(payloads, d_payloads, 10000, cudaMemcpyDeviceToHost);
  for (int i = 0; i < 5; i++) uint40_print(payloads[i]);
  return 0;
}

When I compile and run that code I get output like this:

$ ./t1091
22221
00000
$

sure enough, the GPU output doesn't match the CPU output. If I run the code with cuda-memcheck, a portion of the output I get looks like this:

$ cuda-memcheck ./t1091
========= CUDA-MEMCHECK
22221
========= Invalid __global__ write of size 4
=========     at 0x00000080 in /home/bob/misc/t1091.cu:28:aSimpleKernel(int, unsigned char[5]*)
=========     by thread (0,0,0) in block (0,0,0)
=========     Address 0x402500005 is misaligned

And this gives a clue to the actual problem. In effect you are creating a char array and then superimposing a 5-byte-wide structure (uint40) on it. This means that consecutive uint40 items will start at byte addresses that differ by 5.

When you take one of these addresses and cast it to a int or unsigned int pointer, you may end up with a misaligned pointer. CUDA requires all accesses of POD data types to occur on naturally aligned boundaries. So a 32-bit quantity (e.g. int, float, etc.) must be accessed on a 4-byte boundary (0, 4, 8, ...). Many of the 5-byte boundaries for uint40 (0, 5, 10, ...) don't also fall on 4-byte boundaries, so attempting to access a 4-byte quantity that way is illegal.

One possible solution, for this particular usage example, and assuming the pointer you pass to the kernel is a pointer that is returned by cudaMalloc (for alignment), is just to change your typedef:

typedef unsigned char uint40[8];

This forces every uint40 item to fall on an 8-byte boundary, which is also a 4-byte boundary. A side effect of this would be allocating 3 unused bytes out of every 8 allocated.

In your case, you indicated that the uint40 type was a collection of data, not a single numerical quantity, so it is effectively a data "structure" that happens to occupy 5 bytes per element. An array of such "structures" would effectively be AoS (array of structures) storage format, and a common transformation on such data for performance is to convert it to an SoA (structure of arrays) storage format. Therefore another possible approach would be to create two arrays:

typedef unsigned char uint40a[4];
typedef unsigned char uint40b[1];
uint40a *data1;
uint40b *data2;
cudaMalloc(&data1, size);
cudaMalloc(&data2, size);

and access your data in this fashion. This will maintain the storage density and almost certainly provide faster access to your data in the GPU as compared to your 5-byte structure.

If there is any doubt from the above, you cannot pick up an arbitrary char pointer, cast it to another (larger) datatype, and expect good things to happen. The pointers you use must be properly aligned for the datatype being referenced.

Community
  • 1
  • 1
Robert Crovella
  • 143,785
  • 11
  • 213
  • 257
  • I don't see how `typedef unsigned char uint40[8];` helps. The alignment requirement for an array is equal to the alignment requirement for the element type, so an instance of `uint40` could still be be allocated at, say, address `0x1`. If the compiler lets you add an alignment attribute to array declarations that would help. Not sure whether that is allowed, I am only familiar with use of alignment attributes for `struct`. – njuffa Feb 27 '16 at 20:42
  • I've made a clarifying statement. I assume that the pointer passed to the kernel is a pointer returned by `cudaMalloc` (and not some modified version of it). In that case, I fail to see that there is an issue, and my testing shows that it rectifes the misalignment issue, according to `cuda-memcheck` and numerical testing. – Robert Crovella Feb 27 '16 at 21:08
  • I've updated the sample code in my answer to provide some extra printout. I'm not sure where you think the issue is. – Robert Crovella Feb 27 '16 at 21:27
  • I am coming from a "language specification guarantee" angle: Unless things have changed since I last looked (or I am misunderstanding something fundamental), there is no *guarantee* that the CUDA compiler allocates an `unsigned char uint40[8]` on an 8-byte boundary. Your answer seems to assume otherwise. So I fear this may be a case of "happens to work" not "guaranteed to work". To make predefined `struct`s like `float2` align as desired, the compiler uses alignment attributes. I think this is required for any composite type to be aligned stricter than underlying element type. – njuffa Feb 27 '16 at 21:50
  • I think I get it now. Your answer is based on "assuming the pointer you pass to the kernel is a pointer that is returned by cudaMalloc". That makes it work, and may be easy to do for a simple kernel where one can track where each pointer originated, but it doesn't address the general case, which requires checking alignments, and handling end-cases. – njuffa Feb 27 '16 at 22:01
  • I agree that you cannot pick up an arbitrary `char` pointer and make it work, even if you overlay some kind of `uint40[8]` structure on it. I think I said more or less that in my answer. – Robert Crovella Feb 27 '16 at 22:30