After reviewing the valuable comments and answers from JackOLantern, and re-reading the documentation, I was able to get my head straight. Of course the answer is "trivial" now that I understand it.
In the code below, I define CFPtype
(Complex Floating Point) and FPtype
so that I can quickly change between single and double precision. For example, #define CFPtype cufftComplex
.
I still can't wrap my head around the number of threads used to call the kernel. If it's too large, it simply won't go into the function at all. The documentation doesn't seem to say anything about what number should be used - but this is all for a separate question.
The key in getting my whole program to work (2D FFT on pitched memory and calculating magnitude and argument) was realizing that even though CUDA gives you plenty of "apparent" help in allocating 2D and 3D arrays, everything is still in units of bytes. It's obvious in a malloc call that the sizeof(type)
must be included, but I totally missed it in calls of the type allocate(width, height)
. Noob mistake, I guess. Had I written the library I would have made the type size a separate parameter, but whatever.
So given an image of dimensions width x height
in pixels, this is how it comes together:
Allocating memory
I'm using pinned memory on the host side because it's supposed to be faster. That's allocated with cudaHostAlloc
which is straightforward. For pitched memory, you need to store the pitch for each different width and type, because it could change. In my case the dimensions are all the same (complex to complex transform) but I have arrays that are real numbers so I store a complexPitch
and a realPitch
. The pitched memory is done like this:
cudaMallocPitch(&inputGPU, &complexPitch, width * sizeof(CFPtype), height);
To copy memory to/from pitched arrays you cannot use cudaMemcpy
.
cudaMemcpy2D(inputGPU, complexPitch, //destination and destination pitch
inputPinned, width * sizeof(CFPtype), //source and source pitch (= width because it's not padded).
width * sizeof(CFPtype), height, cudaMemcpyKind::cudaMemcpyHostToDevice);
FFT plan for pitched arrays
JackOLantern provided this answer, which I couldn't have done without. In my case the plan looks like this:
int n[] = {height, width};
int nembed[] = {height, complexPitch/sizeof(CFPtype)};
result = cufftPlanMany(
&plan,
2, n, //transform rank and dimensions
nembed, 1, //input array physical dimensions and stride
1, //input distance to next batch (irrelevant because we are only doing 1)
nembed, 1, //output array physical dimensions and stride
1, //output distance to next batch
cufftType::CUFFT_C2C, 1);
Executing the FFT is trivial:
cufftExecC2C(plan, inputGPU, outputGPU, CUFFT_FORWARD);
So far I have had little to optimize. Now I wanted to get magnitude and phase out of the transform, hence the question of how to traverse a pitched array in parallel. First I define a function to call the kernel with the "correct" threads per block and enough blocks to cover the entire image. As suggested by the documentation, creating 2D structures for these numbers is a great help.
void GPUCalcMagPhase(CFPtype *data, size_t dataPitch, int width, int height, FPtype *magnitude, FPtype *phase, size_t magPhasePitch, int cudaBlockSize)
{
dim3 threadsPerBlock(cudaBlockSize, cudaBlockSize);
dim3 numBlocks((unsigned int)ceil(width / (double)threadsPerBlock.x), (unsigned int)ceil(height / (double)threadsPerBlock.y));
CalcMagPhaseKernel<<<numBlocks, threadsPerBlock>>>(data, dataPitch, width, height, magnitude, phase, magPhasePitch);
}
Setting the blocks and threads per block is equivalent to writing the (up to 3) nested for
-loops. So you have to have enough blocks * threads to cover the array, and then in the kernel you must make sure that you are not exceeding the array size. By using 2D elements for threadsPerBlock
and numBlocks
, you avoid having to go through the padding elements in the array.
Traversing a pitched array in parallel
The kernel uses the standard pointer arithmetic from the documentation:
__global__ void CalcMagPhaseKernel(CFPtype *data, size_t dataPitch, int width, int height,
FPtype *magnitude, FPtype *phase, size_t magPhasePitch)
{
int threadX = threadIdx.x + blockDim.x * blockIdx.x;
if (threadX >= width)
return;
int threadY = threadIdx.y + blockDim.y * blockIdx.y;
if (threadY >= height)
return;
CFPtype *threadRow = (CFPtype *)((char *)data + threadY * dataPitch);
CFPtype complex = threadRow[threadX];
FPtype *magRow = (FPtype *)((char *)magnitude + threadY * magPhasePitch);
FPtype *magElement = &(magRow[threadX]);
FPtype *phaseRow = (FPtype *)((char *)phase + threadY * magPhasePitch);
FPtype *phaseElement = &(phaseRow[threadX]);
*magElement = sqrt(complex.x*complex.x + complex.y*complex.y);
*phaseElement = atan2(complex.y, complex.x);
}
The only wasted threads here are for the cases where the width or height are not multiples of the number of threads per block.