Problem
I have to compute many Fourier transforms. I would like to do these in parallel with my many cores. Note that I don't want a parallel FFT algorithm, I just want to launch many embarrassingly parallel FFTs.
I discover that, while my CPU usage goes up, my time to completion does not decrease.
Example
We create some random data
In [1]: import numpy as np
In [2]: x = np.random.random(10000000) # some random data
And time how long it takes to compute an FFT both cold and after having computed it once.
In [3]: %time _ = np.fft.rfft(x) # cost of one run
CPU times: user 589 ms, sys: 23.9 ms, total: 612 ms
Wall time: 613 ms
In [4]: %time _ = np.fft.rfft(x) # there is some speedup from mulitple runs
CPU times: user 365 ms, sys: 12.4 ms, total: 378 ms
Wall time: 381 ms
We run this on a sequence of data sequentially
In [5]: %time _ = map(np.fft.rfft, [x] * 12) # many runs sequentially
CPU times: user 4.4 s, sys: 135 ms, total: 4.54 s
Wall time: 4.54 s
In [6]: 4.54 / 12 # Same cost per FFT
Out[6]: 0.37833333333333335
We do the same thing but now using a thread pool of four threads.
In [7]: from multiprocessing.pool import ThreadPool
In [8]: pool = ThreadPool(4) # I have four physical cores
In [9]: %time _ = pool.map(np.fft.rfft, [x] * 12)
CPU times: user 15.5 s, sys: 1.3 s, total: 16.8 s
Wall time: 4.79 s
We find that there is no speedup. However we do find that CPU usage, as measured by top
, is near 400%. This isn't an issue with the GIL. There is something about FFT that doesn't parallelize well. Perhaps we're thrashing higher level caches?
Hardware: Intel(R) Core(TM) i5-3320M CPU @ 2.60GHz
Question
Generally what's going on here and is there a way to leverage multiple cores to speed up multiple FFTs in parallel?