numpy fftn very inefficient for 2d fft of several images

Question

I wanted to compute the fourier transform of several images. I was therefore benchmarking numpy's fft.fftn against a brute force for loop.

This is the code I used to benchmark the 2 approaches (in a jupyter notebook):

import numpy as np

x = np.random.rand(32, 256, 256)

def iterate_fft(arr):
    k = np.empty_like(arr, dtype=np.complex64)
    for i, a in enumerate(arr):
        k[i] = np.fft.fft2(a)
    return k

k_it = iterate_fft(x)
k_np = np.fft.fftn(x, axes=(1, 2))
np.testing.assert_allclose(k_it.real, k_np.real)
np.testing.assert_allclose(k_it.imag, k_np.imag)

%%timeit
k_it = iterate_fft(x)

Output: 63.6 ms ± 1.23 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

%%timeit
k_np = np.fft.fftn(x, axes=(1, 2))

Output: 122 ms ± 1.79 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

Why is there such a huge difference ?

I opened a GitHub [issue](https://github.com/numpy/numpy/issues/14617) in `numpy` to know what they think about this. — Zaccharie Ramzi, Oct 01 '19 at 07:56

score 2 · Accepted Answer · answered Oct 01 '19 at 14:27

So a person involved in the numpy fft development has answered the deep question on GitHub and it turns out that the slowdown is most likely coming from some multi dimensional array rearrangement used by pocketfft.

It will all be a memory when numpy switches to the scipy 1.4 implementation which can be shown using my benchmark to not have these drawbacks.

score 1 · Answer 2 · answered Sep 19 '19 at 12:23

1

These routines in numpy seem to currently assume that the last dimension will always be the smallest. When this is actually true fftn is faster, sometimes by a lot.

That said, I get a much smaller difference in performance between these two methods than you (with Python 3.7.4, numpy 1.17.2). For your example, iterate_fft takes 46ms while ffn takes 50. But if I flip the axes around, to (256, 256, 32), I get 55ms and 40ms respectively. Pushing even further with a shape of (256, 256, 2) I get 21ms and 4ms respectively.

Note that if performance is really an issue, there are other FFT libraries available that perform better in some situations. Also the full fftpack in scipy can have very different performance than the more limited code in numpy.

Note that your usage of fftn basically does:

x = np.random.rand(32, 256, 256)

a = np.fft.fft(x, n=256, axis=2)
a = np.fft.fft(a, n=256, axis=1)

np.testing.assert_allclose(np.fft.fftn(x, axes=(1, 2)), a)

answered Sep 19 '19 at 12:23

Sam Mason

15,216
1
41
60

Thanks for your answer. My `numpy` version is 1.16.4 (not a strong requirement I will change to see what happens), and my Python version is 3.6.8 (I would like to keep it that way). I am not entirely sure I understand where your first point comes from when reading the doc, can you give more details? Also, I am surprised by those results you get for a batch size of 2. What happens for a batch size of 1? It would seem that `fftn` is faster at performin FFT2 than `fft2`. Thanks for suggesting other libraries, performance is not really an issue for me, I was just surprised. – Zaccharie Ramzi Sep 20 '19 at 08:46
numpy 1.17 seems to have introduced use of [pocketfft](https://github.com/numpy/numpy/tree/v1.17.0rc1/numpy/fft), which changes performance a bit. note that my shape changes were just demonstrating the case the current code seems to be optimised for, it's doing something different from what you want to do. iterating over images seems to be better for your use case given current optimisations – Sam Mason Sep 20 '19 at 08:51
Yes I just tried with numpy 1.17 and it's also better when iterating with batch dimension first. I will try to profile with batch dimension last, especially the batch size 1 case. Yes totally understand what you are doing, I just don't understand what are those optimisations in particular that make it work so much better in the batch size last case, with a small batch size. – Zaccharie Ramzi Sep 20 '19 at 08:56
So when the batch dimension is the last one, I also have `fftn` beating the iterative version by a small margin. With a batch size of 2 however, the results are the same no matter where the batch dimension is (last or first) and are the following: iterative <-> 2.9ms, `fftn` <-> 2.5ms. So the margin is not nearly as big as yours. I will upload a gist to show that. When batch size is 1, we pay the cost of the for loop I guess, because it's 2.2ms vs 1.2ms. Also, I noticed this in the docs: > fft2 is just fftn with a different default for axes. – Zaccharie Ramzi Sep 20 '19 at 09:18
https://gist.github.com/zaccharieramzi/f05930697b40b74b412e14b420316972 here is the code I used (you can see the results for batch size 1, but it's easily adaptable to 2 or 32). Anyway I think this probably shows some inefficieny in `fftn` right? – Zaccharie Ramzi Sep 20 '19 at 09:20
that's what I mean by "doing something different from what you want to do". I'm leaving `axes=(1,2)` and just changing the input shape. I was just trying to understand where the performance difference was coming from, not trying to get/do the same thing. for your case, just iterate over the images. each fft takes quite a bit of work, so being "vectorised" doesn't matter – Sam Mason Sep 20 '19 at 09:31
Oh ok! yes that wasn't clear to me. However I did notice some difference when changing the order of the dimensions (and the axes accordingly). I still don't understand the first point you make though: where do you see that (in the code not empirically) and why would there be such an assumption? – Zaccharie Ramzi Sep 20 '19 at 09:45

numpy fftn very inefficient for 2d fft of several images

2 Answers2