Trying to get my head around cuda, after not grasping similar stackoverflow questions i decided to test out an example (i'm using cudafy.net for c# but the underlying cuda should be parsable)
I want to do the following. Send a 4x4x4 matrix to the kernel and get a 4x4x4 out according to this logic:
if(input[x,y,z] == 1)
output[x+1, y, z]++;
if(input[x,y,z] == 2)
output[x-1, y, z]++;
I studied the following cudafy example.
public const int N = 1 * 1024;
//Omissions
gpu.Launch(128, 1, function, dev_a, dev_b, dev_c);
kernel:
[Cudafy]
public static void add_0(GThread thread, int[] a, int[] b, int[] c)
{
int tid = thread.blockIdx.x; // (tid 0 -> 127, from my understanding)
while (tid < N)
{
c[tid] = a[tid] + b[tid];
tid += thread.gridDim.x;
}
}
And then tried to transfer it to 3d. I cannot get the indexing right. Say i have the following. (three arrays here just to test indexing)
int size = 4;
int[] dev_delta = gpu.Allocate<int>(size * size * size);
int[] dev_space = gpu.Allocate<int>(size * size * size);
int[] dev_result = gpu.Allocate<int>(size * size * size);
gpu.Launch(new dim3(4, 4, 4), 1, "testIndex", dev_delta, dev_space, dev_result);
And the kernel:
[Cudafy]
public static void testIndex(GThread thread, int[] delta, int[] space, int[] result)
{
int x = thread.blockIdx.x;
int y = thread.blockIdx.y;
int z = thread.blockIdx.z;
delta[x]++;
space[y]++;
result[z]++;
}
Naively I'd expect the following:
delta = {4,4,4,4,0,0,0,0,0, ... 0,0}
space = {4,4,4,4,0,0,0,0,0, ... 0,0}
result = {4,4,4,4,0,0,0,0,0 ... 0,0}
But i get:
delta = {1,1,1,1,0,0,0,0,0, ... 0,0}
space = {1,1,1,1,0,0,0,0,0, ... 0,0}
result = {1,0,0,0,0,0,0,0,0 ... 0,0}
This makes no sense to me, clearly i am missing something.
Questions:
How many threads am i starting?
How do you go about 'indexing' my example problem in 3 dimensions (Starting 4x4x4 threads and getting the variables for flat3DArray[x * sizeY * sizeZ + y * sizeZ + z])?
How do you go about 'indexing' my example problem in 2 dimensions? (Starting 4x4 threads and then let each thread handle a depth column of length 4)
I found this which may be relevant Why is z always zero in CUDA kernel if that is what is messing me up, i'd still appreciate pure-cuda answers to sort my brain out