Python numpy.fft changes strides

Question

Dear stackoverflow community!

Today I found that on a high-end cluster architecture, an elementwise multiplication of 2 cubes with dimensions 1921 x 512 x 512 takes ~ 27 s. This is much too long since I have to perform such computations at least 256 times for an azimuthal averaging of a power spectrum in the current implementation. I found that the slow performance was mainly due to different stride structures (C in one case and FORTRAN in the other). One of the two arrays was a newly generated boolean grid (C order) and the other one (FORTRAN order) came from the 3D numpy.fft.fftn() Fourier transform of an input grid (C order). Any reasons why numpy.fft.fftn() changes the strides and ideas on how to prevent that except for reversing the axes (which would be just a workaround)? With similar strides (ndarray.copy() of the FT grid), ~ 4s are achievable, a tremendous improvement.

The question is therefore as following:

Consider the array:

ran = np.random.rand(1921, 512, 512)
ran.strides
(2097152, 4096, 8)

a = np.fft.fftn(ran)
a.strides
(16, 30736, 15736832)

As we can see the stride structure is different. How can this be prevented (without using a = np.fft.fftn(ran, axes = (1,0)))? Are there any other numpy array routines that could affect stride structure? What can be done in those cases?

Helpful advice is as usual much appreciated!

It could be that `fft` uses fast FORTRAN code, and expects the corresponding order. What happens if the input is also order F? — hpaulj, Oct 18 '16 at 14:48

score 2 · Answer 1 · answered Oct 18 '16 at 19:16

2

You could use scipy.fftpack.fftn (as suggested by hpaulj too) rather than numpy.fft.fftn, looks like it's doing what you want. It is however slightly less performing:

import numpy as np
import scipy.fftpack

ran = np.random.rand(192, 51, 51)  # not much memory on my laptop
a = np.fft.fftn(ran)
b = scipy.fftpack.fftn(ran)

ran.strides
(20808, 408, 8)
a.strides
(16, 3072, 156672)
b.strides
(41616, 816, 16)

timeit -n 100 np.fft.fftn(ran)
100 loops, best of 3: 37.3 ms per loop
timeit -n 100 scipy.fftpack.fftn(ran)
100 loops, best of 3: 41.3 ms per loop

answered Oct 18 '16 at 19:16

rikyborg

81
4

Okay, any reason why np.fft.fftn does that? I haven't got the background and can't find anything useful in the documentation... – bproxauf Oct 19 '16 at 06:10
1

I also haven't found anything in the documentation. However by having a look at the [source code for scipy.fftpack.fftn](https://github.com/scipy/scipy/blob/v0.18.1/scipy/fftpack/basic.py), function _raw_fftnd, you see how the axis are swapped before the computation, and then swapped back after it. So I think the difference is that scipy.fftpack takes care of this automatically, and numpy.fft doesn't: hence the performance difference. – rikyborg Oct 19 '16 at 09:44

score 1 · Accepted Answer · edited May 23 '17 at 12:00

Any reasons why numpy.fft.fftn() changes the strides and ideas on how to prevent that except for reversing the axes (which would be just a workaround)?

Computing the multidimensionnal DFT of an array consists in successively computing 1D DTFs over each dimensions. There are two strategies:

Restrict 1D DTF computations to contiguous 1D arrays. As the array is contiguous, problem related to latency/cache misses will be reduced. This strategy has a major drawback: the array is to be transposed between each dimension. It is likely the strategy adopted by numpy.fft. At the end of computations, the array has been transposed. To avoid unnecessary computations, the transposed array is returned and strides are modified.
Enable 1D DDFT computations for strided arrays. This might trigger some problem related to latency. It is the strategy of fftw, avaible through the interface pyfftw. As a result, the output array features the same strides as the input array.

Timing numpy.fftn and pyfftw.numpy.fftn as performed here and there or there will tell you whether FFTW is really the Fastest Fourier Transform in the West or not...

To check that numpy uses the first strategy, take a look at numpy/fft/fftpack.py. At line 81-85, the call to work_function(a, wsave) (i.e. fftpack.cfftf, from FFTPACK, arguments documented there) is enclosed between calls to numpy.swapaxes() performing the transpositions.
scipy.fftpack.fftn does not seem to change the strides... Nevertheless, it seems that it makes use of the first strategy. scipy.fftpack.fftn() calls scipy.fftpack.zfftnd() which calls zfft(), based on zfftf1 which does not seem to handle strided DFTs. Moreover, zfftnd() calls many times the function flatten() which performs the transposition.
Another example: for parallel distributed memory multidimensionnal DFTs, FFTW-MPI uses the first strategy to avoid any MPI communications between processes during 1D DTFs. Of course, functions to transpose the array are not far away and a lot a MPI communications are involved in the process.

Are there any other numpy array routines that could affect stride structure? What can be done in those cases?

You can search the github repository of numpy for swapaxes: this funtion is only used a couple of times. Hence, to my mind, this "change of strides" is particular to fft.fftn() and most numpy functions keep the strides unchanged.

Finally, the "change of strides" is a feature of the first strategy and there is no way to prevent that. The only workaround is to swap the axes at the end of the computation, which is costly. But you can rely on pyfftw since fftw implements the second strategy in a very efficient way. The DFT computations will be faster, and subsequent computations will be faster as well if the strides of the different arrays become consistent.

Python numpy.fft changes strides

2 Answers2

Linked