1

I have a 3D-image with dimensions 512*512*512. I have to process all the voxels individually. However, I can't get the right dimensions to get the x, y and z-values to get the pixel.

In my kernel I have:

int x = blockIdx.x * blockDim.x + threadIdx.x;
int y = blockIdx.y * blockDim.y + threadIdx.y;
int z = blockIdx.z * blockDim.z + threadIdx.z;

I am running the program by using:

Kernel<<<dim3(8,8), dim3(8,8,16)>>>();

I chose those because having 64 blocks with each 1024 threads should give me every pixel. However, how do I get the coordinate values when I have those dimensions...

When calling the kernel function I have to set some dimensions that the x, y and z-values actually go from 0 to 511. (This gives me the position of every pixel then). But every combination I try, my kernel either does not run or it runs but the values don't get high enough.

The program should make it possible so that every kernel gets a pixel with (x,y,z) that correspond to that pixel in the image. In most simple way I am trying just to print the coordinates to see if it prints all of them.

Any help?

EDIT:

My properties of my GPU:

Compute capability: 2.0
Name: GeForce GTX 480

My program code just to test it out:

#include <stdio.h>
#include <cuda.h>
#include <stdlib.h>

// Device code
__global__ void Kernel()
{
    // Here I should somehow get the x, y and z values for every pixel possible in the 512*512*512 image
    int x = blockIdx.x * blockDim.x + threadIdx.x;
    int y = blockIdx.y * blockDim.y + threadIdx.y;
    int z = blockIdx.z * blockDim.z + threadIdx.z;

    printf("Coords: (%i, %i, %i)\n", x, y, z);
}

// Host code
int main(int argc, char** argv) {

    Kernel<<<dim3(8, 8), dim3(8,8,16)>>>(); //This invokes the kernel
    cudaDeviceSynchronize();

    return 0;
}
Robert Crovella
  • 143,785
  • 11
  • 213
  • 257
  • 1
    Questions seeking debugging help ("why isn't this code working?") must include the desired behavior, a specific problem or error and the shortest code necessary to reproduce it in the question itself. Questions without a clear problem statement are not useful to other readers. See: How to create a [Minimal, Complete, and Verifiable example](http://stackoverflow.com/help/mcve). – Robert Crovella Oct 27 '14 at 17:25
  • I just edited it so that it contains the desired behavior. – Mourad el Maouchi Oct 27 '14 at 17:30
  • Threadblock dimensions in Z are limited to 64. This doesn't necessarily explain your issue, but it might. It's impossible to tell since you've provided almost no code. If you provide a complete [MCVE](http://stackoverflow.com/help/mcve), then I'm sure someone can help spot what the actual issue is. There's nothing wrong with the indexing code you have shown. – Robert Crovella Oct 27 '14 at 17:31
  • I just added the code I am using right now just to debug and see what happens. – Mourad el Maouchi Oct 27 '14 at 17:38

1 Answers1

4

To cover a 512x512x512 space with the indexing you have shown (i.e. one thread per voxel) you would need a kernel launch something like this:

Kernel<<<dim3(64,64,64), dim3(8,8,8)>>>();

When I multiply any of the dimensional components:

64*8

I get 512. This gives me a grid of 512 threads in each of 3 dimensions. Your indexing will work with this arrangement as-is to produce one unique thread per voxel.

The above assumes a cc2.0 or higher device (your mention of 1024 threads per block suggests that you have a cc2.0+ device), which permits 3D grids. If you have a cc1.x device, you will need to modify your indexing.

In that case, you might want something like this:

int x = blockIdx.x * blockDim.x + threadIdx.x;
int y = (blockIdx.y%64) * blockDim.y + threadIdx.y;
int z = (blockIdx.y/64) * blockDim.z + threadIdx.z;

along with a kernel launch like this:

Kernel<<<dim3(64,4096), dim3(8,8,8)>>>();

Here's a fully worked example (cc2.0), based on the code you have now shown:

$ cat t604.cu
#include <stdio.h>

#define cudaCheckErrors(msg) \
    do { \
        cudaError_t __err = cudaGetLastError(); \
        if (__err != cudaSuccess) { \
            fprintf(stderr, "Fatal error: %s (%s at %s:%d)\n", \
                msg, cudaGetErrorString(__err), \
                __FILE__, __LINE__); \
            fprintf(stderr, "*** FAILED - ABORTING\n"); \
            exit(1); \
        } \
    } while (0)

// Device code
__global__ void Kernel()
{
    // Here I should somehow get the x, y and z values for every pixel possible in the 512*512*512 image
    int x = blockIdx.x * blockDim.x + threadIdx.x;
    int y = blockIdx.y * blockDim.y + threadIdx.y;
    int z = blockIdx.z * blockDim.z + threadIdx.z;

    if ((x==511)&&(y==511)&&(z==511)) printf("Coords: (%i, %i, %i)\n", x, y, z);
}

// Host code
int main(int argc, char** argv) {
    cudaFree(0);
    cudaCheckErrors("CUDA is not working correctly");
    Kernel<<<dim3(64, 64, 64), dim3(8,8,8)>>>(); //This invokes the kernel
    cudaDeviceSynchronize();
    cudaCheckErrors("kernel fail");

    return 0;
}
$ nvcc -arch=sm_20 -o t604 t604.cu
$ cuda-memcheck ./t604
========= CUDA-MEMCHECK
Coords: (511, 511, 511)
========= ERROR SUMMARY: 0 errors
$

Note that I have elected to only print out one line. I did not want to wade through 512x512x512 lines of printout, it would take a very long time to run, and in-kernel printf is limited in output volume anyway.

Robert Crovella
  • 143,785
  • 11
  • 213
  • 257
  • When I try that solution of yours in my Kernel function it does not print anything at all. It seems like it's frozen or can't handle the dimensions. It seems not to work for me. – Mourad el Maouchi Oct 27 '14 at 17:40
  • Add proper [cuda error checking](http://stackoverflow.com/questions/14038589/what-is-the-canonical-way-to-check-for-errors-using-the-cuda-runtime-api) to your code. Are you compiling for a cc2.0 device? – Robert Crovella Oct 27 '14 at 17:48
  • Also, with a grid of this size (512x512x512 threads), you are likely to run into a windows TDR event if you are running on windows. Proper cuda error checking will give an indication of that. – Robert Crovella Oct 27 '14 at 18:02
  • I have tried both ways, with the (64,64,64) and (64,4096) but both seem to not print out anything. Unless it is supposed to run for more than 2mins... For me it seems also that even when using 256 instead of 4096 to not run/finish. Any more information is that I run it through SSH on a server through windows. Is that some information that makes it maybe not run? – Mourad el Maouchi Oct 27 '14 at 18:16
  • Can you please run my exact code instead? And run it with `cuda-memcheck` just as I have. – Robert Crovella Oct 27 '14 at 18:28
  • With your exact code it keeps hanging with this: mourade@gpu05:~$ nvcc t604.cu -lcudart -lm -arch=sm_20 -o t604 mourade@gpu05:~$ cuda-memcheck ./t604 ========= CUDA-MEMCHECK – Mourad el Maouchi Oct 27 '14 at 18:34
  • I suspect a machine configuration problem. I've updated my code to include proper cuda error checking. Can you try that instead? Since I have added error checking to the code, you can run it without `cuda-memcheck` – Robert Crovella Oct 27 '14 at 18:48