Is there a method of FFT that will run inside CUDA Kernel?

Question

I am currently converting a C++ program into CUDA code, and part of my program runs a fast Fourier transform. Originally I ran FFTW, but I saw that I couldn't call it in kernel, so I then rewrote that part using cufft but it tells me the same thing!

Are there any FFT that will run inside a CUDA kernel?

Can I just add __device__ to the fftw library?

I would like to avoid having to initialize or call the FFT in host. I want a completely on the gpu type function, if one exists.

score 3 · Answer 1 · answered Jul 22 '12 at 04:09

Looks like you are trying to perform several FFTs at once if you are looking to incorporate it into a kernel. I would look into the batch processing features in cuFFT. What is your application? cufftPlanMany() works for batch FFTs in many different memory configurations.

score 3 · Answer 2 · answered May 27 '22 at 09:20

Since this thread still pops up if you search for this today, I just want to add that NVIDIA introduced cuFFTDx (cuFFT Device Extensions) as GA with CUDA 11.0 (There is also an older early access version). It is a header only library that allows for inline kernel calls of FFT functionalities. I guess this would have been exactly what you searched for 10 years ago.

I guess that NVIDIA wants to provide inline kernels for several other math fields. Hence, the downloaded archive is called mathDx.

Useful links:

score 2 · Accepted Answer · answered Jul 22 '12 at 00:53

2

Are you sure you need to avoid a launch from the host? Nvidia's cufft library is pretty good these days. Porting FFTW seems like a pretty hard task. You might have an easier time porting kissfft but it is still not going to be easy.

answered Jul 22 '12 at 00:53

Mark Borgerding

8,117
4
30
51

Really? The simple powers of 2 length FFT butterfly algorithm is trivial to implement and can be made pretty efficient. The difficulty is making a library that works for general lengths and also runs fast. I would say that rolling your own is not a crazy idea if you just need a very simple implementation. – Henry Gomersall Sep 04 '15 at 10:48

Leos313 · Answer 4 · 2015-07-09T07:32:11.230

there is NO way to call the APIs from the GPU kernel. You must call them from the host. If you want to run a FFT without passing from DEVICE -> HOST -> DEVICE to continue your elaboration I think that the only solution is to write a kernel that performs the FFT in a device function. Actually I'm doing this because I need to run more FFTs in parallel without passing again the datas to the HOST. If you find/have another solution let me know.

Is there a method of FFT that will run inside CUDA Kernel?

4 Answers4

Linked