How to do 100000 times 2d fft in a faster way using python?

Question

I have a 3d numpy array with a shape of (100000, 256, 256), and I'd like to do FFT on every stack of the 2d array, which means 100000 times of FFT.

I have tested the speed of single and the stacked data with minimum code below.

import numpy as np
a = np.random.random((256, 256))
b = np.random.random((10, 256, 256))

%timeit np.fft.fft2(a)

%timeit np.fft.fftn(b, axes=(1, 2,))

Which gives the following:

872 µs ± 19.2 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

6.46 ms ± 227 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

100000 times of fft will take more than one minite.

Is there any faster way to do multiple fft or ifft at the same time?

Update: After a bit search, I found cupy, which seems can help.

Are you running into performance problems with your code? More than likely, there's other places that can be optimized. Numpy is already written in C and highly optimized. — jhpratt, Mar 06 '19 at 01:46
My job simply has to deal with a huge volume of 2d fft. Indeed, numpy fft is optmized, it is faster than many fft schemes. The only thing I can think about is parallel the process. Maybe pycuda is a option, but it takes a lot of effort. — Jiadong, Mar 06 '19 at 01:53
I mean, there's always the option of spinning up a thread pool? — jhpratt, Mar 06 '19 at 01:53
I only have experience in pyCUDA thread, not sure what do you mean by spinning up a thread pool. Sry. — Jiadong, Mar 06 '19 at 02:02
You might want to see [this question](https://stackoverflow.com/questions/3033952/threading-pool-similar-to-the-multiprocessing-pool). That's just the first result that comes up, but there's tons of resources on creating a thread pool in Python. — jhpratt, Mar 06 '19 at 02:05
OK, thanks. Maybe I have to pick up something new to solve this. — Jiadong, Mar 06 '19 at 02:14
You're computing the FFT along the 10-dimension, not over the other two dimensions. You need to do `np.fft.fftn(b, axes=(1, 2))`! — Cris Luengo, Mar 06 '19 at 05:38
@jhpratt : indeed, numpy relies on a compiled fortran code for fft, more specifically the FFTPACK library. But the FFTW library can be significantly faster than the FFTPACK library, leaving an opportunity for optimization. And you're right: multithreading can help. — francis, Mar 06 '19 at 20:11

francis · Accepted Answer · 2019-03-06T20:05:03.740

pyfftw, wrapping the FFTW library, is likely faster than the FFTPACK library wrapped by np.fft and scipy.fftpack. After all, FFTW stands for Fastest Fourier Transform in the West.

The minimal code is:

import numpy as np
import pyfftw
import multiprocessing
b = np.random.random((100, 256, 256))
bb = pyfftw.empty_aligned((100,256, 256), dtype='float64')
bf= pyfftw.empty_aligned((100,256, 129), dtype='complex128')
fft_object_b = pyfftw.FFTW(bb, bf,axes=(1,2),flags=('FFTW_MEASURE',), direction='FFTW_FORWARD',threads=multiprocessing.cpu_count())
bb=b
fft_object_b(bb)

Here is an extended code timing the execution of np.fft and pyfftw:

import numpy as np
from timeit import default_timer as timer
import multiprocessing
a = np.random.random((256, 256))
b = np.random.random((100, 256, 256))

start = timer()
for i in range(10):
    np.fft.fft2(a)
end = timer()
print"np.fft.fft2, 1 slice", (end - start)/10

start = timer()
for i in range(10):
     bf=np.fft.fftn(b, axes=(1, 2,))
end = timer()
print "np.fft.fftn, 100 slices", (end - start)/10
print "bf[3,42,42]",bf[3,42,42]


import pyfftw

aa = pyfftw.empty_aligned((256, 256), dtype='float64')
af= pyfftw.empty_aligned((256, 129), dtype='complex128')
bb = pyfftw.empty_aligned((100,256, 256), dtype='float64')
bf= pyfftw.empty_aligned((100,256, 129), dtype='complex128')
print 'number of threads:' , multiprocessing.cpu_count()

fft_object_a = pyfftw.FFTW(aa, af,axes=(0,1), flags=('FFTW_MEASURE',), direction='FFTW_FORWARD',threads=multiprocessing.cpu_count())

fft_object_b = pyfftw.FFTW(bb, bf,axes=(1,2),flags=('FFTW_MEASURE',), direction='FFTW_FORWARD',threads=multiprocessing.cpu_count())


aa=a
bb=b
start = timer()
for i in range(10):
    fft_object_a(aa)
end = timer()
print "pyfftw, 1 slice",(end - start)/10

start = timer()
for i in range(10):
    fft_object_b(bb)
end = timer()
print "pyfftw, 100 slices", (end - start)/10
print "bf[3,42,42]",bf[3,42,42]

Finally, the outcome is a significant speed up: pyfftw proves 10 times faster than np.fft on my computer., using 2 threads.

np.fft.fft2, 1 slice 0.00459032058716
np.fft.fftn, 100 slices 0.478203487396
bf[3,42,42] (-38.190256258791734+43.03902512127183j)
number of threads: 2
pyfftw, 1 slice 0.000421094894409
pyfftw, 100 slices 0.0439268112183
bf[3,42,42] (-38.19025625879178+43.03902512127183j)

Your computer seems much better than mine!

score 0 · Answer 2 · answered Dec 01 '19 at 12:44

When dealing with FFT in Python, CuPy has been my go to package. It has absolutely amazing performance when dealing with huge sized FFTs, plus several iterations over them. It of course relies on making a 1D[2D] plan internally by calling the cuFFT plan functions, but you may not need to worry about that. I already see a 2X improvement when I change your code to something like this:

import cupy as cp
a = cp.random.random((256, 256))
b = cp.random.random((256, 256))

%timeit cp.fft.fft2(a)

%timeit cp.fft.fftn(b, axes=(0, 1,))

How to do 100000 times 2d fft in a faster way using python?

2 Answers2

Linked