Increased occupancy without speedup in FFT

Question

Problem

I have to compute many Fourier transforms. I would like to do these in parallel with my many cores. Note that I don't want a parallel FFT algorithm, I just want to launch many embarrassingly parallel FFTs.

I discover that, while my CPU usage goes up, my time to completion does not decrease.

Example

We create some random data

In [1]: import numpy as np

In [2]: x = np.random.random(10000000)  # some random data

And time how long it takes to compute an FFT both cold and after having computed it once.

In [3]: %time _ = np.fft.rfft(x)        # cost of one run
CPU times: user 589 ms, sys: 23.9 ms, total: 612 ms
Wall time: 613 ms

In [4]: %time _ = np.fft.rfft(x)        # there is some speedup from mulitple runs
CPU times: user 365 ms, sys: 12.4 ms, total: 378 ms
Wall time: 381 ms

We run this on a sequence of data sequentially

In [5]: %time _ = map(np.fft.rfft, [x] * 12)  # many runs sequentially
CPU times: user 4.4 s, sys: 135 ms, total: 4.54 s
Wall time: 4.54 s

In [6]: 4.54 / 12                       # Same cost per FFT
Out[6]: 0.37833333333333335

We do the same thing but now using a thread pool of four threads.

In [7]: from multiprocessing.pool import ThreadPool

In [8]: pool = ThreadPool(4)            # I have four physical cores

In [9]: %time _ = pool.map(np.fft.rfft, [x] * 12)
CPU times: user 15.5 s, sys: 1.3 s, total: 16.8 s
Wall time: 4.79 s

We find that there is no speedup. However we do find that CPU usage, as measured by top, is near 400%. This isn't an issue with the GIL. There is something about FFT that doesn't parallelize well. Perhaps we're thrashing higher level caches?

Hardware: Intel(R) Core(TM) i5-3320M CPU @ 2.60GHz

Question

Generally what's going on here and is there a way to leverage multiple cores to speed up multiple FFTs in parallel?

@MaxHutchinson, does not seem to be hardware related. See this here as well. — cel, Sep 27 '15 at 18:28
@cel but I do see a speedup on my workstation hardware. FFT is all memory system, so on under-powered machines, I could imagine a single thread being able to saturate memory capabilities. — Max Hutchinson, Sep 27 '15 at 18:38
@MaxHutchinson an old personal notebook. Not a nice workstation. Added details in question. — MRocklin, Sep 27 '15 at 18:49

Max Hutchinson · Accepted Answer · 2015-09-27T18:49:05.030

2

On my workstation, ThreadPool does provide a speed-up (though not a perfect one):

In [42]: x = np.random.random(2**23)

In [43]: %time _ = list(map(np.fft.rfft, [x]*12))
CPU times: user 3.32 s, sys: 380 ms, total: 3.7 s
Wall time: 3.7 s

In [44]: tpool = ThreadPool(4)

In [45]: %time _ = list(tpool.map(np.fft.rfft, [x]*12))
CPU times: user 5.4 s, sys: 596 ms, total: 6 s
Wall time: 1.62 s

In [46]: 3.7/4
Out[46]: 0.925

I'm using Python3, so maybe there is something there? Otherwise, its likely hardware. FFTs are memory-bound, so its quite possible that a single thread is saturating your memory system. You might be able to get better memory system locality by dropping down to an environment that let's you control affinity.

Hardware

Intel(R) Core(TM) i7-4930K CPU @ 3.40GHz.

edited Sep 27 '15 at 18:49

answered Sep 27 '15 at 18:45

Max Hutchinson

136
4

So does "waiting for memory transfer" count towards CPU usage as measured by system utilities? – MRocklin Sep 27 '15 at 18:48
@MRocklin : probably? See http://stackoverflow.com/questions/556405/what-do-real-user-and-sys-mean-in-the-output-of-time1 – Max Hutchinson Sep 27 '15 at 18:51

Increased occupancy without speedup in FFT

Problem

Example

Question

1 Answers1

Hardware