Comparatively slow python numpy 3D Fourier Transformation

Question

For my work I need to perform discrete fourier transformations (DFTs) on large images. In the current example I require a 3D FT for a 1921 x 512 x 512 image (along with 2D FFTs of 512 x 512 images). Right now, I am using the numpy package and the associated function np.fft.fftn(). The code snippet below exemplarily shows 2D and 3D FFT times on an equal-sized/slightly smaller 2D/3D random-number-generated grid in the following way:

import sys
import numpy as np
import time

tas = time.time()
a = np.random.rand(512, 512)
tab = time.time()
b = np.random.rand(100, 512, 512)

tbfa = time.time()

fa = np.fft.fft2(a)
tfafb = time.time()
fb = np.fft.fftn(b)
tfbe = time.time()

print "initializing 512 x 512 grid:", tab - tas
print "initializing 100 x 512 x 512 grid:", tbfa - tab
print "2D FFT on 512 x 512 grid:", tfafb - tbfa
print "3D FFT on 100 x 512 x 512 grid:", tfbe - tfafb

Output:

initializing 512 x 512 grid: 0.00305700302124
initializing 100 x 512 x 512 grid: 0.301637887955
2D FFT on 512 x 512 grid: 0.0122730731964
3D FFT on 100 x 512 x 512 grid: 3.88418793678

The problem that I have is that I will need this process quite often, so the time spent per image should be short. When testing on my own computer (middle-segment laptop, 2GB RAM allocated to virtual machine (--> therefore smaller test grid)), as you can see the 3D FFT takes ~ 5 s (order-of-magnitude). Now, at work, the machines are way better, cluster/grid-architecture systems and FFTs are much faster. In both cases the 2D ones finish quasi instantaneously.

However with 1921x512x512, np.fft.fftn() takes ~ 5 min. Since I guess scipy's implementation is not much faster and considering that on MATLAB FFTs of same-sized grids finish within ~ 5 s, my question is whether there is a method to speed the process up to or almost to MATLAB times. My knowledge about FFTs is limited, but apparently MATLAB uses the FFTW algorithm, which python does not. Any reasonable chance that with some pyFFTW package I get similar times? Also, 1921 seems an unlucky choice, having only 2 prime factors (17, 113), so I assume this also plays a role. On the other hand 512 is a well-suited power of two. Are MATLAB-like times achievable if possible also without padding up with zeros to 2048?

I'm asking because I'll have to use FFTs a lot (to an amount where such differences will be of huge influence!) and in case there is no possibility to reduce computation times in python, I'd have to switch to other, faster implementations.

If pyfftw fails, try comparing against R or octave's fft implementations. If any of those work faster you may call those implementations from python (don't know how big is the penalty) — xvan, Oct 15 '16 at 21:05

score 4 · Accepted Answer · edited May 23 '17 at 10:27

Yes, there is a chance that using FFTW through the interface pyfftw will reduce your computation time compared to numpy.fft or scipy.fftpack. The performances of these implementations of DFT algorithms can be compared in benchmarks such as this one : some interesting results are reported in Improving FFT performance in Python

I suggest the following code for a test:

import pyfftw
import numpy
import time
import scipy

f = pyfftw.n_byte_align_empty((127,512,512),16, dtype='complex128')
#f = pyfftw.empty_aligned((33,128,128), dtype='complex128', n=16)
f[:] = numpy.random.randn(*f.shape)

# first call requires more time for plan creation
# by default, pyfftw use FFTW_MEASURE for the plan creation, which means that many 3D dft are computed so as to choose the fastest algorithm.
fftf=pyfftw.interfaces.numpy_fft.fftn(f)

#help(pyfftw.interfaces)
tas = time.time()
fftf=pyfftw.interfaces.numpy_fft.fftn(f) # here the plan is applied, nothing else.
tas = time.time()-tas
print "3D FFT, pyfftw:", tas

f = pyfftw.n_byte_align_empty((127,512,512),16, dtype='complex128')
#f = pyfftw.empty_aligned((33,128,128), dtype='complex128', n=16)
f[:] = numpy.random.randn(*f.shape)


tas = time.time()
fftf=numpy.fft.fftn(f)
tas = time.time()-tas
print "3D FFT, numpy:", tas

tas = time.time()
fftf=scipy.fftpack.fftn(f)
tas = time.time()-tas
print "3D FFT, scipy/fftpack:", tas

# first call requires more time for plan creation
# by default, pyfftw use FFTW_MEASURE for the plan creation, which means that many 3D dft are computed so as to choose the fastest algorithm.
f = pyfftw.n_byte_align_empty((128,512,512),16, dtype='complex128')
fftf=pyfftw.interfaces.numpy_fft.fftn(f)

tas = time.time()
fftf=pyfftw.interfaces.numpy_fft.fftn(f) # here the plan is applied, nothing else.
tas = time.time()-tas
print "3D padded FFT, pyfftw:", tas

For a size of 127*512*512, on my modest computer, I got:

3D FFT, pyfftw: 3.94130897522
3D FFT, numpy: 16.0487070084
3D FFT, scipy/fftpack: 19.001199007
3D padded FFT, pyfftw: 2.55221295357

So pyfftw is significantly faster than numpy.fft and scipy.fftpack. Using padding is even faster, but the thing that is computed is different.

Lastly, pyfftw may seem slower at the first run due to the fact that it uses the flag FFTW_MEASURE according to the documentation. It's a good thing if and only if many DFTs of the same size are successively computed.

Okay, thanks for the answer first of all. As part of my work, I then need to do an azimuthal averaging and for that I multiply two cubes elementwise of dimensions 1921x512x512. At first it took 25 s roughly (too long because I'll have to do this often). I found out that it has to do with strides (which I did not know of until today). The numpy FFT changes it automatically from C to Fortran style. Any way to prevent this (except for a copy)? With same-(C)-styled strides the time goes down to ~ 4 s. — bproxauf, Oct 18 '16 at 12:24
Giving the axes argument as (2,1,0) instead of (0,1,2) preserves the strides order, but there should be an easier way than that workaround... — bproxauf, Oct 18 '16 at 12:36
I am not sure to understand what you meant by "The numpy FFT changes it automatically from C to Fortran style". You can use `print fftf.shape` to check whether the dimensions are inverted or not: it's not the case. Indeed, if the shape of the input is 127x512x512, the shape of the output is 127x512x512. In addition, I have timed `numpy.multiply(f,fftf)` to perform an element-wise multiplication: it is about 10 times faster than the pyfftw dft for the size 127x512x512. Hence, I would be surprised if the bottleneck turns out to be an element-wise multiplication! — francis, Oct 18 '16 at 19:52
I was referring to the strides that were changed: see also my new question: — bproxauf, Oct 19 '16 at 06:07
http://stackoverflow.com/questions/40109915/python-numpy-fft-changes-strides I received several answers saying that I could use scipy.fftpack instead of numpy.fft (I don't know yet about pyFFTW since I have to wait till Monday until the package is centrally installed (I do not have sudo rights)). Apparently stride structure is preserved there. But I still do not see the reason for numpy.fft.fftn changing the structure in the first place. — bproxauf, Oct 19 '16 at 06:09
Yes, you are right ! `numpy.fftn()` changes the strides and `scipy.fftpack()` keeps it unchanged. The good thing is that `pyfftw` keeps it unchanged as well. Hence, the `numpy.multiply()` is faster because the strides are consistent. I think I will post an answer to your second question... — francis, Oct 19 '16 at 18:37

score 0 · Answer 2 · answered May 04 '19 at 09:37

You could try FFT from Intel MKL (Math Kernel Library) which is faster than FFTW. Intel provides mkl-fft for Python which replaces numpy.fft. All that you need to do is type:

pip install mkl-fft

and run your program again, without any changes.

Also, numpy 1.17 (soon to be released) will have new implementation of FFT:

Replacement of the fftpack-based FFT module by the pocketfft library

Both implementations have the same ancestor (Fortran77 FFTPACK by Paul N. Swarztrauber), but pocketfft contains additional modifications which improve both accuracy and performance in some circumstances. For FFT lengths containing large prime factors, pocketfft uses Bluestein’s algorithm, which maintains O(N log N) run time complexity instead of deteriorating towards O(N*N) for prime lengths. Also, accuracy for real-valued FFTs with near-prime lengths has improved and is on par with complex-valued FFTs.

Comparatively slow python numpy 3D Fourier Transformation

2 Answers2

Replacement of the fftpack-based FFT module by the pocketfft library

Linked