Dear stackoverflow community!
Today I found that on a high-end cluster architecture, an elementwise multiplication of 2 cubes with dimensions 1921 x 512 x 512 takes ~ 27 s. This is much too long since I have to perform such computations at least 256 times for an azimuthal averaging of a power spectrum in the current implementation. I found that the slow performance was mainly due to different stride structures (C in one case and FORTRAN in the other). One of the two arrays was a newly generated boolean grid (C order) and the other one (FORTRAN order) came from the 3D numpy.fft.fftn() Fourier transform of an input grid (C order). Any reasons why numpy.fft.fftn() changes the strides and ideas on how to prevent that except for reversing the axes (which would be just a workaround)? With similar strides (ndarray.copy() of the FT grid), ~ 4s are achievable, a tremendous improvement.
The question is therefore as following:
Consider the array:
ran = np.random.rand(1921, 512, 512)
ran.strides
(2097152, 4096, 8)
a = np.fft.fftn(ran)
a.strides
(16, 30736, 15736832)
As we can see the stride structure is different. How can this be prevented (without using a = np.fft.fftn(ran, axes = (1,0)))? Are there any other numpy array routines that could affect stride structure? What can be done in those cases?
Helpful advice is as usual much appreciated!