I wrote a matrix computation C++ library 20 years ago and I’m willing to boost its performance using intel MKL library. For complex value vector/matrix, my library uses two split arrays: one for the real part and one for imaginary part.
Here are the timing results:
- N=65536, fftw time = 0.005(s), fftpack time = 0.001(s)
- N=100000, fftw time = 0.005(s), fftpack time = 0.003(s)
- N=131072, fftw time = 0.006(s), fftpack time = 0.004(s)
- N=250000, fftw time = 0.013(s), fftpack time = 0.007(s)
- N=262144, fftw time = 0.012(s), fftpack time = 0.008(s)
- N=524288, fftw time = 0.022(s), fftpack time = 0.018(s)
- N=750000, fftw time = 0.037(s), fftpack time = 0.025(s)
- N=1048576, fftw time = 0.063(s), fftpack time = 0.059(s)
- N=1500000, fftw time = 0.114(s), fftpack time = 0.079(s)
- N=2097152, fftw time = 0.126(s), fftpack time = 0.146(s)
- N=4194304, fftw time = 0.241(s), fftpack time = 0.35(s)
- N=8388608, fftw time = 0.433(s), fftpack time = 0.788(s)
For vectors with length < 1500000 double value fftpack is faster than fftw.
Here is the code I use:
Matrix X=randn(M,1); //input vector
//start timer
Matrix Y = MyFFTW(X);
// measure time
//function to compute the FFT
Matrix MyFFTW(Matrix X)
{
int M= X.rows();
int N= X.cols();
Matrix Y(T_COMPLEX,M,N); // output complex to store FFT results
// Input data could also ba matrix
double* in_data = (double*)fftw_malloc(sizeof(double) * M );
fftw_complex* out_data = (fftw_complex*)fftw_malloc(sizeof(fftw_complex) * (M / 2 + 1));
fftw_plan fftplan = fftw_plan_dft_r2c_1d(M, in_data, out_data, FFTW_ESTIMATE);
//one fftplan is used for all the matrix columns
for (int i = 1; i <= N; i++)
{
//copy column number i to in_dataused by the fftplan, arrays indexing is 1-based like matlab
memcpy(in_data, X.pr(1,i), M* sizeof(double));
fftw_execute(fftplan);
//split out_data to real and imag parts
double* pr = Y.pr(1,i), * pi = Y.pi(1,i);
int k = (M - 1) / 2, j;
for (j = 0; j <= k; j++)
{
*pr++ = out_data[j][0];
*pi++ = out_data[j][1];
}
if (M % 2 == 0)
{
*pr++ = out_data[M/2][0];
*pi++ = out_data[M/2][1];
}
for (j = k; j >= 1; j--)
{
*pr++ = out_data[j][0];
*pi++ = out_data[j][1];
}
}
fftw_destroy_plan(fftplan);
fftw_free(in_data);
fftw_free(out_data);
return Y;
}
Results are obtained on Intel core i7 @ 3.2 GHz using Visual Studio 2019 as compiler and the last intel MKL library. Compiler flags are:
/fp:fast /DWIN32 /O2 /Ot /Oi /Oy /arch:AVX2 /openmp /MD
Linker libs are:
mkl_intel_c.lib mkl_intel_thread.lib mkl_core.lib libiomp5md.lib
Is there a better way to make fftw faster for vector of small size ?
Update:
I tested against Matlab that uses MKL fftw for fft computation :
- N=65536, matlab fft time = 0.071233(s)
- N=100000, matlab fft time = 0.011437(s)
- N=131072, matlab fft time = 0.0074411(s)
- N=250000, matlab fft time = 0.015349(s)
- N=262144, matlab fft time = 0.0082545(s)
- N=524288, matlab fft time = 0.011395(s)
- N=750000, matlab fft time = 0.022364(s)
- N=1048576, matlab fft time = 0.019683(s)
- N=1500000, matlab fft time = 0.033493(s)
- N=2097152, matlab fft time = 0.035345(s)
- N=4194304, matlab fft time = 0.069539(s)
- N=8388608, matlab fft time = 0.1387(s)
Except the first call to fft with N=65536, Matlab(64bits) is faster than both my function (win32) using fftpack (for N > 500000) and using MKL fftw.
Thanks