How many FLOPS for FFT?

Question

I would like to know how many FLOPS a Fast Fourier Transform (FFT) performs.

So, if I have a 1 dimensional array of N float numbers and I would like to calculate the FFT of this set of numbers, how many FLOPS need to be performed?

I know that this depends on the used algorithm, but what about the fastest available?

I also know that the scaling of a FFT is of the order of N*log(N) but this would not answer my question.

Unless you're working with very limited hardware, it's unlikely that you can extrapolate from this data to see how big of an FFT you can perform. — Filip Haglund, Oct 14 '16 at 07:02
(For any given input and (machine-level-)implementation it would perform a number of FloatingPointOpereations - FLOs? For one given machine(&-settings&boundary conditions (think _thermal throttle_)), it would take some number of allocated or wall clock seconds, allowing to figure out FLOPS.) (To chime in with F.Haglund: expect _massive_ memory hierarchy effects.) — greybeard, Oct 14 '16 at 08:03
FFT execution time is dominated by memory latency, because of the strided memory accesses it performs. On typical processors, the math is the easy part, accessing memory is the problem. — doug65536, Oct 15 '16 at 01:21
This question is a duplicate of: http://stackoverflow.com/questions/7957907/how-many-ffts-per-second-can-i-do-on-my-smartphone-for-performing-voice-recogn — stackoverflowuser2010, Oct 15 '16 at 01:22
@doug65536 Only for sufficiently large FFT transforms. A 64^3 3D FFT fits in cache, for example. — mabraham, Oct 20 '18 at 13:25

score 3 · Answer 1 · edited May 23 '17 at 12:34

That depends on implementation. Fastest does not necessary mean lowest FLOP nor highest FLOPS. The speed is often achieved by exploiting HW architecture rather than lowering FLOP. There are too many implementations out there so your question without actual code and architecture is unanswerable.

I like precomputed W matrix implementations as I usually use FFT for single resolution matrices many times so no need to compute W more then once per resolution. That can cut down FLOP per recursion layer significantly.

For example this DFFTcc has 14 FLOP per iteration using only +,-,* operations. Assuming 1D FFT case N=8 and using basic data-type if I did not make any silly mistake:

FLOP = 8*14 + (4+4)*14 +(2+2+2+2+2)*14 +(1+1+1+1+1+1+1+1)*2 = 14*N*log2(N) + 2*N = 352

If you use Real input/output you can even lower that for first/last recursion layer. But simple FLOP count is not enough as some operations are more complicated then others. And also FLOP are not the only thing that affect speed.

Now to get the FLOPS just measure time [s] the FFT takes:

FLOPS = FLOP/time

I assume you mean floating point operations (FLO as @greybeard put it ?), and not floating point operations per second? — MayeulC, Oct 14 '16 at 08:28
@MayeulC FLOP = floating point operation yes and FLOPS are FLOP/s — Spektre, Oct 14 '16 at 08:29
@MayeulC you can only measure the **FLOPS** not compute it. Then you can compare to your HW datasheet parameters to estimate utilization as graybeard mentioned there will be a big difference. Mostly due to memory, scheduling,sync and other issues. — Spektre, Oct 14 '16 at 08:36

score 2 · Answer 2 · edited May 23 '17 at 12:16

As underlined by Spektre, the actual FLOPS (Floating Point OPerations per Second) depend on the particular hardware and implementation and higher FLOP (Floating Point OPeration) algorithms may correspond to lower FLOPS implementations, just because with such implementations you can more effectively exploit the hardware.

If you want to compute the number of floating point operations for a Decimation In Time radix-2 approach, the you can refer to the following figure:

Let N the length of the sequence to be transformed. There is an overall number of log2N stages and each stage contains N/2 butterflies. Let us then consider the generic butterfly:

Let us rewrite the output of the generic butterfly as

E(i + 1) = E(i) + W * O(i)
O(i + 1) = E(i) - W * O(i)

A butterfly thus involves one complex multiplication and two complex additions. On rewriting the above equations in terms of real and imaginary parts, we have

real(E(i + 1)) = real(E(i)) + (real(W) * real(O(i)) - imag(W) * imag(O(i)))
imag(E(i + 1)) = imag(E(i)) + (real(W) * imag(O(i)) + imag(W) * real(O(i)))

real(O(i + 1)) = real(O(i)) - (real(W) * real(O(i)) - imag(W) * imag(O(i)))
imag(O(i + 1)) = imag(O(i)) - (real(W) * imag(O(i)) + imag(W) * real(O(i)))

Accordingly, we have

4 multiplications

real(W) * real(O(i)), 
imag(W) * imag(O(i)), 
real(W) * imag(O(i)), 
imag(W) * real(O(i)).

6 sums

real(W) * real(O(i)) – imag(W) * imag(O(i))     (1)
real(W) * imag(O(i)) + imag(W) * real(O(i))     (2)
real(E(i)) + eqn.1
imag(E(i)) + eqn.2
real(E(i)) – eqn.1
imag(E(i)) – eqn.2

Therefore, the number of operations for the Decimation In Time radix 2 approach are

2N * log2(N) multiplications
3N * log2(N) additions

These operation counts may change if the multiplications are differently arranged, see Complex numbers product using only three multiplications.

The same results apply to the case of Decimation in Frequency radix 2, see figure

score 0 · Answer 3 · answered Oct 14 '16 at 08:00

0

You can estimate flops-performance at the FFTW benchmark page. Slightly outdated but contains results for the most effective FFT implementations.

It seems that rough estimate is about 5000 MFlops for 3.0 GHz Intel Xeon Core Duo

answered Oct 14 '16 at 08:00

MBo

77,366
5
53
86

hotpaw2 · Answer 4 · 2016-10-15T01:19:01.653

The "fastest available" is not only very processor dependent, but likely to use a completely different algorithm my test. But I counted the flops for a bog-simple non-recursive in-place decimation-in-time radix-2 FFT taken right out of an old ACM algorithms textbook for an FFT of length 1024, and got 20480 fmuls and 30720 fadds (this was using a pre-computed twiddle factor table, thus the transcendental function computations were not included in the flop counts). But note that this code additionally used a ton of integer array index calculations, sine table lookups, and data moves that probably took a lot more of the CPUs cycles than did the FPU. Much larger FFTs would likely also incur a large amount of additional data cache misses and other memory latency penalties. It's possible in that situation to speed up the code by adding more FLOPs in exchange for reduced memory hierarchy latency penalties. So, YMMV.

How many FLOPS for FFT?

4 Answers4

Linked