I want to subdivide an image, of size [32,32] for example, to smaller tiles (e.g. [8,8]), and perform a batched 2D FFT on all of the tiles. Is it possible with cuFFT, perhaps using cufftPlanMany() and some combination of istride, idist, and inembed parameters?
The way I see it, I would need to reshape my input image to a size of [8,4,8,4], and then permute the middle two indices for a final shape of [8,8,4*4], and then I could run the standard 2D batched FFT. I could do this with a custom CUDA kernel that would involve copy-pasting, but I was wondering if cuFFT already has this functionality?
Here is my brute force kernel that transforms a [32,32] image into batched [8,8,4*4=16] format:
// Output tile coordinates
int x = threadIdx.x;
int y = threadIdx.y;
// Output tile size, e.g. [8,8]
int KX = blockDim.x;
int KY = blockDim.y;
// Full dataset size, e.g. [32,32]
int NX = blockDim.x * gridDim.x;
int NY = blockDim.y * gridDim.y;
// Batch ID
int batch = blockIdx.x + blockIdx.y * gridDim.x;
// Input data coordinates
int nx = threadIdx.x + blockIdx.x * blockDim.x;
int ny = threadIdx.y + blockIdx.y * blockDim.y;
// Copy-paste to a batched format, e.g. [8,8,4*4=16]
output[x + y*KX + batch*KX*KY] = input[nx + ny*NX];
On the other hand, cuFFT provides Advanced Data Layout in the form of
input[ batch * idist + (y * inembed[1] + x) * istride ]
output[ batch * odist + (y * onembed[1] + x) * ostride ]
Let me try and map it to my kernel above. The best I can do is
output[x + y*KX + batch*KX*KY] =
input[x + y*KY*NX + batch*KX + blockIdx.y*(KY-1)*NX]
The problem is that I can't map the batch index without leaving extra terms, either blockIdx.x or blockIdx.y. I think it boils down to the fact that a permutation is required to turn the image into the batched format, which cannot be achieved with just a linear transformation (strides and offsets) available in cuFFT.
As a bonus question, I would also like to have my 2D tiles overlap, similar to this question in 1D: Is it possible to overlap batched FFTs with CUDA's cuFFT library and cufftPlanMany?