Refactoring an algorithm to avoid inefficient slicing of a large Numpy array

Question

I have a working algorithm to analyse experimental datasets. This algorithm is made of two main functions. The first one takes a large array as one of its inputs and returns an intermediate 3D, complex-valued array that often does not fit in memory. For this reason, I use Numpy’s memmap to save this array on the disk. With larger datasets, the second function starts to take a lot more time and it appears to be linked to memory access. In some instances, if a computation lasts 20 minutes, then increasing n2 by 50% results in a computation that takes almost 24 hours.

After stripping about 99% of the whole program, a minimal working example looks like this:

import numpy as np

n1, n2 = 10, 100

Nb = 12
Nf = int(n2 / Nb)

X = np.random.rand(n1, n2)

#   Function 1: Compute X2 from X1 and save on disk
X2 = np.memmap("X2.dat", shape=(Nb, n1, Nf), dtype=np.complex128, mode='w+')

for n in range(Nb):
    Xn = X[:, n*Nf:(n+1)*Nf]
    X2[n,:,:] = np.fft.fft(Xn, axis=1)

X2.flush()
del X2

#   Function 2: Do something with slices of X2
X2 = np.memmap("X2.dat", shape=(Nb, n1, Nf), dtype=np.complex128, mode='r')

for k in range(Nf):
    Y = np.pi * X2[:,:,k]  #   <- Problematic step
    # do things with Y...

The values of n1, n2, Nb and Nf and typically much larger.

As you can see, in function 1, X2 is populated using its first index, which according to this post is the most efficient way to do it in terms of speed. My problem occurs in function 2, where I need to work on X2 by slicing it on its third dimension. This is required by the nature of the algorithm.

I would like to find a way to refactor these functions to reduce the execution time. I already improved some parts of the algorithm. For example, since I know X contains only real values, I can cut the size of X2 by 2 if I neglect its conjugate values. However, the slicing of X2 remains problematic. (Actually, I just learned about np.fft.rfft while writing this, which will definitly help me).

Do you see a way I could refactor function 2 (and/or function 1) so I can access X2 more efficiently?

Update

I tested yut23's solution on one of my largest dataset and it turns out that the first optimization, moving Nf to axis 1, is slightly faster overall than moving it to axis 0.

Below are three charts showing the profiling results for the total execution time, time spent in function 1 (excluding X2.flush()) and time spent in function 2, respectively. On the x axis, Nr is a value proportional to n2 and Nb. I tested my initial code with both optimizations and the modified code using Numpy's rfft(), also with both optimizations.

With my initial code, opt. 1 is the better option with a total time reduction of more than one order of magnitude for Nr=12. Using rfft() almost gives another order of magnitude in time reduction, but in this case, both optimizations are equivalent (everything fits in the available RAM so time reduction from swapping the array axes is minimal). However, this will make it possible to work on larger datasets more efficiently!

How big are n1, n2, Nb and Nf approximately in practice? How much memory space do you roughly have? — Jérôme Richard, May 14 '22 at 00:59
@JérômeRichard I work on datasets with `n1≈44000`, `n2≈10000-34000` `Nb≈110-330` and `Nf=300`. The PC I have access to has 64GB of RAM with a quite slow HDD. I am testing @yut23 optimizations right now and will update my question with the results. — JD80121, May 15 '22 at 17:45
Ok, thank you. For these values, the solution of @yut23 should be much faster. Certainly two order of magnitude faster. Your curent code was reading the whole file 300 time in practice. There are way to make it a bit faster but at the expense of a much more complex code assuming your HDD bandwidth will not be saturated. You can iotop on Linux for that. As to measure the maximum bandwidth, I think a basic Python benchmark should do the job. — Jérôme Richard, May 15 '22 at 20:46
with `rfft`, it takes half the space in memory. The OS can mostly operate in memory(since you should use up to 32 GiB of RAM in this case, over 64) and use an async write-back using an internal cache. Since FFT are expensive, it can certainly overlap the writing with computation. The read from the storage device is not needed due to the cache. It should be still a bit slower than just doing everything in RAM but this makes the code more flexible. You can check that by allocating more RAM with a 30 GiB uint8 array (this should be slower since the OS cannot use memory for the cache anymore). — Jérôme Richard, May 24 '22 at 13:04
Note that if you execute the operation many times, then the cache may not be large enough and you will certainly pay the big overhead of the writes (required synchronization) unless the FFT is slower than that (it would be surprising if you use a slow HDD). — Jérôme Richard, May 24 '22 at 13:06

score 1 · Accepted Answer · answered May 13 '22 at 21:46

An easy optimization is to swap the last two axes, which shouldn't change the speed of function 1 (assuming the out-of-order memory accesses from the transpose are negligible compared to disk access) but should make function 2 run faster, for the same reasons discussed in the question you linked:

#   Function 1: Compute X2 from X1 and save on disk
X2 = np.memmap("X2.dat", shape=(Nb, Nf, n1), dtype=np.complex128, mode='w+')

for n in range(Nb):
    Xn = X[:, n*Nf:(n+1)*Nf]
    X2[n,:,:] = np.fft.fft(Xn, axis=1).T  # swap axes so Nf changes slower

X2.flush()
del X2

#   Function 2: Do something with slices of X2
X2 = np.memmap("X2.dat", shape=(Nb, Nf, n1), dtype=np.complex128, mode='r')

for k in range(Nf):
    Y = np.pi * X2[:,k,:]
    # do things with Y...

You can make function 2 even faster at the cost of a slowdown in function 1 by moving Nf to axis 0:

#   Function 1: Compute X2 from X1 and save on disk
X2 = np.memmap("X2.dat", shape=(Nf, Nb, n1), dtype=np.complex128, mode='w+')

for n in range(Nb):
    Xn = X[:, n*Nf:(n+1)*Nf]
    X2[:,n,:] = np.fft.fft(Xn, axis=1).T  # swap axes so Nf changes slower

X2.flush()
del X2

#   Function 2: Do something with slices of X2
X2 = np.memmap("X2.dat", shape=(Nf, Nb, n1), dtype=np.complex128, mode='r')

for k in range(Nf):
    Y = np.pi * X2[k,:,:]
    # do things with Y...

It might make sense to use this version if X2 is read more times than it's written. Also, the slowdown of function 1 should get smaller as n1 gets bigger, since the contiguous chunks are larger.

With the data files stored on a hard drive and n1, n2, Nb = 1000, 10000, 120, my timings are

function 1, original:          1.53 s ± 41.9 ms
function 1, 1st optimization:  1.53 s ± 27.8 ms
function 1, 2nd optimization:  1.57 s ± 34.9 ms

function 2, original:          111 ms ± 1.2 ms
function 2, 1st optimization:  45.5 ms ± 197 µs
function 2, 2nd optimization:  27.8 ms ± 29.7 µs

The data files are certainly not stored/loaded from your hard-drive but from a cache in memory. OS use caching to improve performance mainly on HDDs. A good HDD have a bandwidth of generally about 100 Mio/s. Here, the throughput is `16*Nb*n1*Nf/1024**2/0.0278 = 5467 Mio/s`. You should try with a much bigger data for the benchmark to be representative of the OP use-case. — Jérôme Richard, May 14 '22 at 01:07

Refactoring an algorithm to avoid inefficient slicing of a large Numpy array

1 Answers1