Looping over 3 dimensional arrays in CUDA to sum their elements

Question

I'm having some problems understanding how to loop over 3 dimensional arrays with a kernel.

This is the code I have so far:

#include <iostream>
#include <ctime>

#include <cuda.h>
#include <cuda_runtime.h>
#include <device_launch_parameters.h>

using namespace std;


int main()
{
// Array properties
const int width = 1;
const int height = 1;
const int depth = 1;

// Declaration of arrays
float h_A[width][height][depth];
float h_B[width][height][depth];
float h_C[width][height][depth] = {{{0}}};

// Fill up arrays
srand(time(0));
for(int i = 0; i < width; i++){
    for(int j = 0; j < height; j++){
        for(int z = 0; z < depth; z++){
            h_A[i][j][z] = rand()%1000;
            h_B[i][j][z] = rand()%1000;
        }
    }
}

// Declaration of device pointers
cudaPitchedPtr d_A, d_B, d_C;

// Allocating memory in GPU
cudaExtent extent = make_cudaExtent(width*sizeof(float),height,depth);
cudaMalloc3D(&d_A, extent);
cudaMalloc3D(&d_B, extent);
cudaMalloc3D(&d_C, extent);

// Copying memory from host to device
cudaMemcpy3DParms p;
p.srcPtr = make_cudaPitchedPtr(&h_A, sizeof(float)*width, height, depth);
p.extent = extent;
p.kind = cudaMemcpyHostToDevice;

p.dstPtr = d_A;
cudaMemcpy3D(&p);
p.dstPtr = d_B;
cudaMemcpy3D(&p);
p.dstPtr = d_C;
cudaMemcpy3D(&p);

system("pause");
return 0;
}

How do I make a kernel that loops over each element in the arrays and adds them together?

Do not make a one to one translation from C/C++ to CUDA for this kind of operations. Have a look at the CUDA SDK reduction example. Think parallel :-) — Vitality, Sep 15 '13 at 19:40
Two things, if all you want is the sum you should use http://docs.nvidia.com/cuda/cublas/index.html#cublas-lt-t-gt-asum simply because its optimized like crazy, aftually if all you want is the sum, then I would guess that a CPU based BLAS library would be quicker. I believe the conventional knowledge is that BLAS Level1 functions will be memory bound, and not GPU bound, keeping this kind of thing on the cpu would make sense — Martin Kristiansen, Sep 15 '13 at 20:11
A full working example can be now found in the answer to [Copying from cuda 3D memory to linear memory: copied data is not where I expected](http://stackoverflow.com/questions/16107480/copying-from-cuda-3d-memory-to-linear-memory-copied-data-is-not-where-i-expecte/23052768#23052768). — Vitality, Apr 14 '14 at 05:46

score 2 · Accepted Answer · edited May 23 '17 at 12:05

There is an example on page 21 of the CUDA 4.0 programming guide for looping over 2D array of floats:

// Host code
int width = 64, height = 64;
float* devPtr;
size_t pitch;
cudaMallocPitch(&devPtr, &pitch,
width * sizeof(float), height);
MyKernel<<<100, 512>>>(devPtr, pitch, width, height);


// Device code
__global__ void MyKernel(float* devPtr, size_t pitch, int width, int height)
{
   for (int r = 0; r < height; ++r) 
    {
       float* row = (float*)((char*)devPtr + r * pitch);
          for (int c = 0; c < width; ++c) 
              {
              float element = row[c];
              }
     }
}

rewrite it to sum up elements should be easy. Additionally you can refer to this thread. When efficiency is concern, you might also look on parallel reduction approach in CUDA. This is used for example when implementing Monte Carlo simulation (see Multi Monte Carlo example).

To effectively sum up elements by CUDA is rather difficult task. See [reduction](http://developer.download.nvidia.com/compute/cuda/1.1-Beta/x86_website/projects/reduction/doc/reduction.pdf) for more details. — stuhlo, Sep 15 '13 at 19:39
efficiency is different topic, question is about basic idea of how to loop over an array — 4pie0, Sep 15 '13 at 19:45

Looping over 3 dimensional arrays in CUDA to sum their elements

1 Answers1