How do you iterate through a pitched CUDA array?

Question

Having parallelized with OpenMP before, I'm trying to wrap my head around CUDA, which doesn't seem too intuitive to me. At this point, I'm trying to understand exactly how to loop through an array in a parallelized fashion.

Cuda by Example is a great start.

The snippet on page 43 shows:

__global__ void add( int *a, int *b, int *c ) {
  int tid = blockIdx.x; // handle the data at this index
  if (tid < N)
     c[tid] = a[tid] + b[tid];
  }

Whereas in OpenMP the programmer chooses the number of times the loop will run and OpenMP splits that into threads for you, in CUDA you have to tell it (via the number of blocks and number of threads in <<<...>>>) to run it sufficient times to iterate through your array, using a thread ID number as an iterator. In other words you can have a CUDA kernel always run 10,000 times which means the above code will work for any array up to N = 10,000 (and of course for smaller arrays you're wasting cycles dropping out at if (tid < N)).

For pitched memory (2D and 3D arrays), the CUDA Programming Guide has the following example:

// Host code
int width = 64, height = 64; 
float* devPtr; size_t pitch; 
cudaMallocPitch(&devPtr, &pitch, width * sizeof(float), height);

MyKernel<<<100, 512>>>(devPtr, pitch, width, height); 

// Device code 
__global__ void MyKernel(float* devPtr, size_t pitch, int width, int height) 
{ 
    for (int r = 0; r < height; ++r) {
        float* row = (float*)((char*)devPtr + r * pitch); 
        for (int c = 0; c > width; ++c) { 
            float element = row[c]; 
        }
    }
}

This example doesn't seem too useful to me. First they declare an array that is 64 x 64, then the kernel is set to execute 512 x 100 times. That's fine, because the kernel does nothing other than iterate through the array (so it runs 51,200 loops through a 64 x 64 array).

According to this answer the iterator for when there are blocks of threads going on will be

int tid = (blockIdx.x * blockDim.x) + threadIdx.x;

So if I wanted to run the first snippet in my question for a pitched array, I could just make sure I had enough blocks and threads to cover every element including the padding that I don't care about. But that seems wasteful.

So how do I iterate through a pitched array without going through the padding elements?

In my particular application I have a 2D FFT and I'm trying to calculate arrays of the magnitude and angle (on the GPU to save time).

Your question isn't clear to me. In the code snippet of the CUDA C Programming Guide you are quoting, you are not going through the padding elements, but you are skipping them. Likewise, if you allocate the arrays involved in the CUDA By Example parallel summation by `cudaMallocPitch`, you have to do the same to skip the padding. I do not see how you could avoid it. — Vitality, Jun 19 '14 at 05:54
If you need to use cuFFT in connection to pitched arrays, you may wish to take a look at [CUFFT : How to calculate the fft when the input is a pitched array](http://stackoverflow.com/questions/20847021/cufft-how-to-calculate-the-fft-when-the-input-is-a-pitched-array). — Vitality, Jun 19 '14 at 05:55
@JackOLantern I see what you mean that the snippet skips the padded elements, because it's not actually a parallelized loop it traverses the entire image serially 51,200 times in parallel (omitting details of how many threads can run at once, etc...) So how do you traverse an image **once** in parallel, skipping the padding? — darda, Jun 20 '14 at 02:18
@JackOLantern Thanks for the link on how to do the 2D FFT. That I also had wrong.... — darda, Jun 20 '14 at 02:34
If you want to form a 2D grid in which each thread accesses a different element of a 2D matrix allocated by `cudaMallocPitch`, then you could take a look at my answer to this post: [Performance of cudaMalloc3D instead of cudaMallocPitch for 2D objects](http://stackoverflow.com/questions/22986777/performance-of-cudamalloc3d-instead-of-cudamallocpitch-for-2d-objects). — Vitality, Jun 20 '14 at 07:14

score 1 · Accepted Answer · edited May 23 '17 at 12:27

After reviewing the valuable comments and answers from JackOLantern, and re-reading the documentation, I was able to get my head straight. Of course the answer is "trivial" now that I understand it.

In the code below, I define CFPtype (Complex Floating Point) and FPtype so that I can quickly change between single and double precision. For example, #define CFPtype cufftComplex.

I still can't wrap my head around the number of threads used to call the kernel. If it's too large, it simply won't go into the function at all. The documentation doesn't seem to say anything about what number should be used - but this is all for a separate question.

The key in getting my whole program to work (2D FFT on pitched memory and calculating magnitude and argument) was realizing that even though CUDA gives you plenty of "apparent" help in allocating 2D and 3D arrays, everything is still in units of bytes. It's obvious in a malloc call that the sizeof(type) must be included, but I totally missed it in calls of the type allocate(width, height). Noob mistake, I guess. Had I written the library I would have made the type size a separate parameter, but whatever.

So given an image of dimensions width x height in pixels, this is how it comes together:

Allocating memory

I'm using pinned memory on the host side because it's supposed to be faster. That's allocated with cudaHostAlloc which is straightforward. For pitched memory, you need to store the pitch for each different width and type, because it could change. In my case the dimensions are all the same (complex to complex transform) but I have arrays that are real numbers so I store a complexPitch and a realPitch. The pitched memory is done like this:

cudaMallocPitch(&inputGPU, &complexPitch, width * sizeof(CFPtype), height);

To copy memory to/from pitched arrays you cannot use cudaMemcpy.

cudaMemcpy2D(inputGPU, complexPitch,  //destination and destination pitch
inputPinned, width * sizeof(CFPtype), //source and source pitch (= width because it's not padded).
width * sizeof(CFPtype), height, cudaMemcpyKind::cudaMemcpyHostToDevice);

FFT plan for pitched arrays

JackOLantern provided this answer, which I couldn't have done without. In my case the plan looks like this:

int n[] = {height, width};
int nembed[] = {height, complexPitch/sizeof(CFPtype)};
result = cufftPlanMany(
    &plan, 
    2, n, //transform rank and dimensions
    nembed, 1, //input array physical dimensions and stride
    1, //input distance to next batch (irrelevant because we are only doing 1)
    nembed, 1, //output array physical dimensions and stride
    1, //output distance to next batch
    cufftType::CUFFT_C2C, 1);

Executing the FFT is trivial:

cufftExecC2C(plan, inputGPU, outputGPU, CUFFT_FORWARD);

So far I have had little to optimize. Now I wanted to get magnitude and phase out of the transform, hence the question of how to traverse a pitched array in parallel. First I define a function to call the kernel with the "correct" threads per block and enough blocks to cover the entire image. As suggested by the documentation, creating 2D structures for these numbers is a great help.

void GPUCalcMagPhase(CFPtype *data, size_t dataPitch, int width, int height, FPtype *magnitude, FPtype *phase, size_t magPhasePitch, int cudaBlockSize)
{
    dim3 threadsPerBlock(cudaBlockSize, cudaBlockSize);
    dim3 numBlocks((unsigned int)ceil(width / (double)threadsPerBlock.x), (unsigned int)ceil(height / (double)threadsPerBlock.y));

    CalcMagPhaseKernel<<<numBlocks, threadsPerBlock>>>(data, dataPitch, width, height, magnitude, phase, magPhasePitch);
}

Setting the blocks and threads per block is equivalent to writing the (up to 3) nested for-loops. So you have to have enough blocks * threads to cover the array, and then in the kernel you must make sure that you are not exceeding the array size. By using 2D elements for threadsPerBlock and numBlocks, you avoid having to go through the padding elements in the array.

Traversing a pitched array in parallel

The kernel uses the standard pointer arithmetic from the documentation:

__global__ void CalcMagPhaseKernel(CFPtype *data, size_t dataPitch, int width, int height,
                                   FPtype *magnitude, FPtype *phase, size_t magPhasePitch)
{
    int threadX = threadIdx.x + blockDim.x * blockIdx.x;
    if (threadX >= width) 
        return;

    int threadY = threadIdx.y + blockDim.y * blockIdx.y;
    if (threadY >= height)
        return;

    CFPtype *threadRow = (CFPtype *)((char *)data + threadY * dataPitch);
    CFPtype complex = threadRow[threadX];

    FPtype *magRow = (FPtype *)((char *)magnitude + threadY * magPhasePitch);
    FPtype *magElement = &(magRow[threadX]);

    FPtype *phaseRow = (FPtype *)((char *)phase + threadY * magPhasePitch);
    FPtype *phaseElement = &(phaseRow[threadX]);

    *magElement = sqrt(complex.x*complex.x + complex.y*complex.y);
    *phaseElement = atan2(complex.y, complex.x);
}

The only wasted threads here are for the cases where the width or height are not multiples of the number of threads per block.

[Here](http://stackoverflow.com/a/4394965/149506) is a good resource to start learning about optimizing threads per block. — darda, Jun 20 '14 at 16:50

How do you iterate through a pitched CUDA array?

1 Answers1

Linked