Cuda/cudafy 3d indexing

Question

Trying to get my head around cuda, after not grasping similar stackoverflow questions i decided to test out an example (i'm using cudafy.net for c# but the underlying cuda should be parsable)

I want to do the following. Send a 4x4x4 matrix to the kernel and get a 4x4x4 out according to this logic:

if(input[x,y,z] == 1)
    output[x+1, y, z]++;

if(input[x,y,z] == 2)
    output[x-1, y, z]++;

I studied the following cudafy example.

public const int N = 1 * 1024;

//Omissions

gpu.Launch(128, 1, function, dev_a, dev_b, dev_c);

kernel:

[Cudafy]
public static void add_0(GThread thread, int[] a, int[] b, int[] c)
{
    int tid = thread.blockIdx.x; // (tid 0 -> 127, from my understanding)
    while (tid < N)
    { 
        c[tid] = a[tid] + b[tid];
        tid += thread.gridDim.x;
    }
}

And then tried to transfer it to 3d. I cannot get the indexing right. Say i have the following. (three arrays here just to test indexing)

int size = 4;
int[] dev_delta = gpu.Allocate<int>(size * size * size);
int[] dev_space = gpu.Allocate<int>(size * size * size);
int[] dev_result = gpu.Allocate<int>(size * size * size);

gpu.Launch(new dim3(4, 4, 4), 1, "testIndex", dev_delta, dev_space, dev_result);

And the kernel:

[Cudafy]
public static void testIndex(GThread thread, int[] delta, int[] space, int[] result)
{
    int x = thread.blockIdx.x;
    int y = thread.blockIdx.y;
    int z = thread.blockIdx.z;
    delta[x]++;
    space[y]++;
    result[z]++;
}

Naively I'd expect the following:

delta = {4,4,4,4,0,0,0,0,0, ... 0,0}
space = {4,4,4,4,0,0,0,0,0, ... 0,0}
result = {4,4,4,4,0,0,0,0,0 ... 0,0}

But i get:

delta = {1,1,1,1,0,0,0,0,0, ... 0,0}
space = {1,1,1,1,0,0,0,0,0, ... 0,0}
result = {1,0,0,0,0,0,0,0,0 ... 0,0}

This makes no sense to me, clearly i am missing something.

Questions:

How many threads am i starting?

How do you go about 'indexing' my example problem in 3 dimensions (Starting 4x4x4 threads and getting the variables for flat3DArray[x * sizeY * sizeZ + y * sizeZ + z])?

How do you go about 'indexing' my example problem in 2 dimensions? (Starting 4x4 threads and then let each thread handle a depth column of length 4)

I found this which may be relevant Why is z always zero in CUDA kernel if that is what is messing me up, i'd still appreciate pure-cuda answers to sort my brain out

score 2 · Accepted Answer · edited May 23 '17 at 12:34

How many threads am I starting ? You are starting 1 thread per block, hence 16 total since the Z parameter is not used. For better performance, I would recommend also using threads (at least 128, and multiple of 32 anyways).

How do you go about 'indexing' my example problem in 3 dimensions (Starting 4x4x4 threads and getting the variables for flat3DArray[x * sizeY * sizeZ + y * sizeZ + z])? The second parameter of gpu.Launch method is for threads. x, y and z could hence be threadIdx.x, threadIdx.y and threadIdx.z respectively. But you may also want to use many blocks, thus threadIdx.x + blockDim.x * blockIdx.x could be a good peak.

The link you provided here explains why your Z dimension is not relevant. CUDAfy.Net exposes the launch function that further calls cuda runtime CUDA/C API call. When passing parameters from dot net to native environment, it seems that CUDAfy.Net simply ignores the Z argument leaving it to one. (this is most probably due to the fact that early versions of CUDA did not support the Z parameter different than one). The explanation is not pure-cuda because CUDA now supports Z value different than one, but your parameter is simply ignored in the CUDAfy.Net implementation.

But 16 threads does not explain my results. I assume each thread runs my kernel, then: `delta[x]++;` should be run 16 times and the sum of that the elements in that array should equal 16. Have i done something wrong, is cudafy broken or have i missed something about memory managment on the device? My delta[x]++; might be overwriting eachother? — Adam, May 17 '17 at 12:38
@Adam, Since increments are neither atomic nor protected against data race issues, it is likely that all threads read zero increment the value in register and write one to the output. Note that with one thread per block, several blocks may be run in parallel exactly at the same time. Also, latency of a read/write is such that this patter is very likely to occur. — Florent DUGUET, May 17 '17 at 13:30

Cuda/cudafy 3d indexing

1 Answers1