Is there an efficient way to use FFTW / CUFFT (they have similar APIs) to perform an fft over a given axis of a multi-dimensional array?
Let's say I have a 3D array of shape (2, 3, 4). The strides are (12, 4, 1), meaning that in order to move one unit along the last axis, we move by 1 unit in the flat array, while to move along the first axis by one unit, we have to step over 3 * 4 = 12 units. (The array is a numpy ndarray which can also have other strides when axes are transposed, but I'd be happy with an answer that addresses just this particular 3D case with the given strides)
Now let's say I want to compute a 1D fft along the middle axis. CUFFT exposes the following function:
cufftResult cufftPlanMany(
cufftHandle *plan, // Plan to be initialized
int rank, // Rank = 1 for 1D fft
int *n, // shape of the fft = 3
int *inembed,
int istride,
int idist,
int *onembed,
int ostride,
int odist,
cufftType type, // e.g. 64 bit float to 128 bit complex
int batch // Could use batch = 2 for the first axis
);
I think we need the nembed
, stride
, dist
parameters to do the transform. They are documented here:
http://docs.nvidia.com/cuda/cufft/index.html#advanced-data-layout
The dumentation states that for a 1D fft, the element in batch b at position x will be taken from:
input[b * idist + x * istride]
However, element at position [b][x][z] is stored in:
input[b * 12 + x * 4 + z]
So it's not clear how to make CUFFT loop over the third (z) axis.
If I set:
- idist and odist to 3*4=12 (so that incrementing b moves us along the first axis) and,
- istride and ostride to 4 (so that incrementing x moves along the second axis, which is the axis along which we want to fft),
- batch = 2
- inembed and onembed to 3 (but according to the documentation these are ignored for a 1D transform)
then it computes the correct fft for each of 2 batches where the last axis index is 0, but leaves the sub-arrays for which the last index is 1, 2, or 3 untouched.
This seems like a common use-case, but I can't seem to figure out how to do this with the given parameters without making multiple calls (which is expensive on the GPU) or making a copy with different memory layout.