I have a working algorithm to analyse experimental datasets. This algorithm is made of two main functions. The first one takes a large array as one of its inputs and returns an intermediate 3D, complex-valued array that often does not fit in memory. For this reason, I use Numpy’s memmap
to save this array on the disk. With larger datasets, the second function starts to take a lot more time and it appears to be linked to memory access. In some instances, if a computation lasts 20 minutes, then increasing n2
by 50% results in a computation that takes almost 24 hours.
After stripping about 99% of the whole program, a minimal working example looks like this:
import numpy as np
n1, n2 = 10, 100
Nb = 12
Nf = int(n2 / Nb)
X = np.random.rand(n1, n2)
# Function 1: Compute X2 from X1 and save on disk
X2 = np.memmap("X2.dat", shape=(Nb, n1, Nf), dtype=np.complex128, mode='w+')
for n in range(Nb):
Xn = X[:, n*Nf:(n+1)*Nf]
X2[n,:,:] = np.fft.fft(Xn, axis=1)
X2.flush()
del X2
# Function 2: Do something with slices of X2
X2 = np.memmap("X2.dat", shape=(Nb, n1, Nf), dtype=np.complex128, mode='r')
for k in range(Nf):
Y = np.pi * X2[:,:,k] # <- Problematic step
# do things with Y...
The values of n1
, n2
, Nb
and Nf
and typically much larger.
As you can see, in function 1, X2
is populated using its first index, which according to this post is the most efficient way to do it in terms of speed. My problem occurs in function 2, where I need to work on X2
by slicing it on its third dimension. This is required by the nature of the algorithm.
I would like to find a way to refactor these functions to reduce the execution time. I already improved some parts of the algorithm. For example, since I know X
contains only real values, I can cut the size of X2
by 2 if I neglect its conjugate values. However, the slicing of X2
remains problematic. (Actually, I just learned about np.fft.rfft
while writing this, which will definitly help me).
Do you see a way I could refactor function 2 (and/or function 1) so I can access X2
more efficiently?
Update
I tested yut23's solution on one of my largest dataset and it turns out that the first optimization, moving Nf
to axis 1, is slightly faster overall than moving it to axis 0.
Below are three charts showing the profiling results for the total execution time, time spent in function 1 (excluding X2.flush()
) and time spent in function 2, respectively. On the x axis, Nr
is a value proportional to n2
and Nb
. I tested my initial code with both optimizations and the modified code using Numpy's rfft()
, also with both optimizations.
With my initial code, opt. 1 is the better option with a total time reduction of more than one order of magnitude for Nr=12
. Using rfft()
almost gives another order of magnitude in time reduction, but in this case, both optimizations are equivalent (everything fits in the available RAM so time reduction from swapping the array axes is minimal). However, this will make it possible to work on larger datasets more efficiently!