0

this is my first attempt at a CUDA program. This is what it's supposed to do:

  1. Receive 1D Pixel array from host memory
  2. Each Pixel is processed by one thread: it is thread-safe because only "val" is read and only "newval" is updated. Wait for sync.
  3. Each Pixel is processed by one thread: copy "newval" to "val."
  4. Write this array back to host memory.
  5. Repeat 2-4 for several different frames.

What happens, however, is that only a couple of variables, out of about 32000, in the new arrays seem to have decent values at all; the rest are zero.

I've removed the calculations for brevity.

__global__ void kernel(Pixel *array, float dt)
{
    const unsigned int tid = threadIdx.x;
    Pixel *point = array + tid;
    //DO A BUNCH OF CALCULATIONS ON PIXEL KIND OF LIKE THIS
    point->newval = point->val + foo;
}

__global__ void copykernel(Pixel *array)
{
    const unsigned int tid = threadIdx.x;
    Pixel *point = array + tid;
    //COPY THE NEWVALS OVER TO THE OLD VALS IN PREPARATION FOR THE NEXT FRAME
    point->val = point->newval;
}

extern "C" bool runIt(const int argc, const char **argv, Pixel *inarray, Pixel **outarrays, int arraysize, int numframes, float dt)
{
    int memsize = arraysize*sizeof(Pixel);
    int i=0;

    Pixel *array;
    cudaMalloc((void **) &array, memsize);
    cudaMemcpy(array, inarray, memsize, cudaMemcpyHostToDevice);

    int numthreads = arraysize;
    dim3 grid(1,1,1);
    dim3 threads(numthreads,1,1);

    for(i=0;i<numframes;i++)
    {
        kernel<<<grid, threads>>>((Pixel *) array, dt);
        cudaThreadSynchronize();
        copykernel<<<grid, threads>>>((Pixel *) array);
        cudaThreadSynchronize();
        cudaMemcpy(array, outarrays[i], memsize, cudaMemcpyDeviceToHost);
    }
    cudaFree(array);
    return true;
}

I have a suspicion that I'm setting up the parameters for the device incorrectly, or else I'm getting one of the device-specific keywords wrong or forgetting a crucial step. Does anything jump out at you?

Frank Harris
  • 305
  • 1
  • 6
  • 16

1 Answers1

1

I don't think you can run that many threads, and if you can, its not a good idea. Try setting the number of threads to 256 (16x16 for 2D), then choosing gridsize based on your input size.

dim3 threads(256,1,1);
dim3 grid(arraysize/threads.x,1,1); //Careful of integer division, this is just for example

Also your second copy is incorrect. You need to switch array and out_arrays

cudaMemcpy(outarrays[i], array, memsize, cudaMemcpyDeviceToHost);
Constantin
  • 16,812
  • 9
  • 34
  • 52
  • D'oh. I should've caught that faulty cudaMemcpy...And of course, that was the problem. Thanks. Now, about the threads...It didn't give me a problem with that number; can you explain why it's a problem? Finally, I need to wait for all threads to finish before running the second kernel. Can I just combine them, then, and add a cudaThreadSynchronize in between? – Frank Harris May 09 '13 at 01:33
  • Frank you should definitely fix the threads as well, let me see if I can find the article for you.... – Constantin May 09 '13 at 01:34
  • http://stackoverflow.com/questions/4391162/cuda-determining-threads-per-block-blocks-per-grid – Constantin May 09 '13 at 01:35
  • For your double kernel, I guess if you're doing something where multiple blocks might need the same old_val then you need two kernels. But if a pixel is only touched ONCE in your entire algorithm, no need for two overheads. An example of when this is false, is an image warp algorithm which may sample a source pixel multiple times. – Constantin May 09 '13 at 01:37
  • Right...That wouldn't show up since I removed a big chunk of the code, but the pixels reference the values of adjacent pixels (it's a finite difference atmospheric model). However, would it work if I put the code from the second kernel into the first, and added a cudaThreadSynchronize in the middle? – Frank Harris May 09 '13 at 01:40
  • No, you might get errors on your boundaries between blocks (earlier link) as syncs in kernels are only within a block. In your instance what you're doing is necessary, and overhead of calling a new kernel is very minimal. – Constantin May 09 '13 at 01:42
  • Alright...I'll figure out this thread issue as well, but for now it seems that my program is at least "correct." Thanks for your help, friend! – Frank Harris May 09 '13 at 01:51
  • Yes your program will run correctly, but efficiently, I am unsure. When you get a moment take some time to understand how NVidia's Streaming Processors work, so keywords you want to look for are, blocks, grids, warps, SM. Anyways happy CUDAing! – Constantin May 09 '13 at 01:55