Using "cuFFT Device Callbacks"

Question

This is my first question, so I'll try to be as detailed as possible. I'm working on implementing noise reduction algorithm in CUDA 6.5. My code is based on this Matlab implementation: http://pastebin.com/HLVq48C1.
I'd love to use new cuFFT Device Callbacks feature, but I'm stuck on cufftXtSetCallback. Every time my cufftResult is CUFFT_NOT_IMPLEMENTED (14). Even example provided by nVidia fails the same way... My device callback testing code:

__device__ void noiseStampCallback(void *dataOut,
                                size_t offset,
                                cufftComplex element,
                                void *callerInfo,
                                void *sharedPointer) {
    element.x = offset;
    element.y = 2;
    ((cufftComplex*)dataOut)[offset] = element;
}
__device__ cufftCallbackStoreC noiseStampCallbackPtr = noiseStampCallback;

CUDA part of my code:

cufftHandle forwardFFTPlan;//RtC
//find how many windows there are
int batch = targetFile->getNbrOfNoiseWindows();
size_t worksize;

cufftCreate(&forwardFFTPlan);
cufftMakePlan1d(forwardFFTPlan, WINDOW, CUFFT_R2C, batch, &worksize); //WINDOW = 2048 

//host memory, allocate
float *h_wave;
cufftComplex *h_complex_waveSpec;
unsigned int m_num_real_elems = batch*WINDOW*2;
h_wave = (float*)malloc(m_num_real_elems * sizeof(float));
h_complex_waveSpec = (cufftComplex*)malloc((m_num_real_elems/2+1)*sizeof(cufftComplex));

//init
memset(h_wave, 0, sizeof(float) * m_num_real_elems); //last window won't probably be full of file data, so fill memory with 0
memset(h_complex_waveSpec, 0, sizeof(cufftComplex) * (m_num_real_elems/2+1));
targetFile->getNoiseFile(h_wave); //fill h_wave with samples from sound file

//device memory, allocate, copy from host
float *d_wave;
cufftComplex *d_complex_waveSpec;

cudaMalloc((void**)&d_wave, m_num_real_elems * sizeof(float));
cudaMalloc((void**)&d_complex_waveSpec, (m_num_real_elems/2+1) * sizeof(cufftComplex));

cudaMemcpy(d_wave, h_wave, m_num_real_elems * sizeof(float), cudaMemcpyHostToDevice);

//prepare callback
cufftCallbackStoreC hostNoiseStampCallbackPtr;

cudaMemcpyFromSymbol(&hostNoiseStampCallbackPtr,
                          noiseStampCallbackPtr,
                          sizeof(hostNoiseStampCallbackPtr));

cufftResult status = cufftXtSetCallback(forwardFFTPlan,
                                        (void **)&hostNoiseStampCallbackPtr,
                                        CUFFT_CB_ST_COMPLEX,
                                        NULL);
//always return status 14 - CUFFT_NOT_IMPLEMENTED

//run forward plan
cufftResult result = cufftExecR2C(forwardFFTPlan, d_wave, d_complex_waveSpec);
//result seems to be okay without cufftXtSetCallback

I'm aware that I'm just a beginner in CUDA. My question is:
How can I call cufftXtSetCallback properly or what is a cause of this error?

Robert Crovella · Accepted Answer · 2014-09-13T15:49:53.193

4

Referring to the documentation:

The callback API is available in the statically linked cuFFT library only, and only on 64 bit LINUX operating systems. Use of this API requires a current license. Free evaluation licenses are available for registered developers until 6/30/2015. To learn more please visit the cuFFT developer page.

I think you are getting the not implemented error because either you are not on a Linux 64 bit platform, or you are not explicitly linking against the CUFFT static library. The Makefile in the cufft callback sample will give the correct method to link.

Even if you fix that issue, you will likely run into a CUFFT_LICENSE_ERROR unless you have gotten one of the evaluation licenses.

Note that there are various device limitations as well for linking to the cufft static library. It should be possible to build a statically linked CUFFT application that will run on cc 2.0 and greater devices.

edited Sep 13 '14 at 15:49

answered Sep 13 '14 at 15:37

Robert Crovella

143,785
11
213
257

You're right. The only part of this note I missed was 64 bit LINUX. Well, thanks for your help! – Ghany Sep 13 '14 at 15:55
We understand that these limitations are annoying, and we're working to remove some of them in future releases. They exist for technical reasons, so some engineering is required to work around them. Stay tuned... – Jonathan Cohen Sep 21 '14 at 01:59
I found this webpage more explicitly helpful for setting up statically linked code as well as getting a license. The webpage also says the licenses will be eliminated in the future. http://devblogs.nvidia.com/parallelforall/cuda-pro-tip-use-cufft-callbacks-custom-data-processing/ – MrMas Sep 01 '15 at 18:29
I just ran into this. Callbacks are available without license in Cuda 7.0 (am I the last person in the world to learn about this)? https://devtalk.nvidia.com/default/topic/833368/long-term-cufft-callback-licenses/ – MrMas Sep 03 '15 at 21:17
3

@JonathanCohen: Are you aware whether the 64-bit Linux, static linking limitations are planned to change at any point in the near future? 64-bit and static linking aren't that big of a concern, but it would be nice to get the support on OS X/Windows. – Jason R Oct 19 '16 at 21:09

Sebastian · Answer 2 · 2020-09-17T10:44:50.610

A new (2019) possibility are cuFFT device extensions (cuFFTDX). Being part of the Math Library Early Access they are device FFT functions, which can be inlined into user kernels.

Announcement of cuFFTDX:

https://developer.download.nvidia.com/video/gputechconf/gtc/2019/presentation/s9240-cuda-new-features-and-beyond.pdf

Math Library Early Access:

https://developer.nvidia.com/cuda-math-library-early-access-program-page

Example Code:

https://github.com/mnicely/cufft_examples

Using "cuFFT Device Callbacks"

2 Answers2