Why does saving slices from an array using numpy.save is slower depending on the direction of the slice?

Question

I have 3D volumes of data (x,y,z) and I want to save 2D slices (xy, yz, xz planes) and save them for future use.

The way I tried to do so is by having a function (slice_data) to take the slices and another one (save_slices) to call slice_data and then use numpy.save to save the slices.

If I am not saving the slices, the time to take the slices is similar irrespective of whether I extract the xy, yz, xz planes. However, if I save the slices, the time to save slices depends on the direction of the slice, and is different for xy, yz, xz planes

Why is that? Using my full data, this difference goes from minutes to hours...

I have written below a dummy code, similar to the one I use in my full dataset to demonstrate the problem. The time difference between is shown below:

Not saving , just slicing

mean time x-slice: 0.00011536836624145507 sec
mean time y-slice: 0.00011417627334594726 sec
mean time z-slice: 0.00011371374130249023 sec

Slicing and saving:

mean time x-slice: 0.04629791975021362 sec
mean time y-slice: 0.06096100091934204 sec
mean time z-slice: 0.08996494293212891 sec

Code:

import os
import numpy as np
import time
import matplotlib.pyplot as plt

# take a slice of the data
def slice_data(roi):
    dic = {}
    data = np.zeros((512,512,256))
    dic['data'] = np.squeeze( data[roi[0]:roi[1]+1, roi[2]:roi[3]+1, roi[4]:roi[5]+1] )
    return dic


# save slices if the data
def save_slices(roi, save=False):
    var = 'data'
    for i in range(0,6):
                # iterate to simulate a time series of data
        a = slice_data(roi)[var]
        var_dir = 'save_test/'
        if not os.path.exists(var_dir): os.makedirs(var_dir)
        file = var_dir + '{0:04d}{1}'.format(i,'.npy')

        if save is True:
            np.save(file, a)


## define slices
roix=[256, 256, 0, 512, 0, 256] # yz plane slice
roiy=[0, 512, 256, 256, 0, 256] # xz plane slice
roiz=[0, 512, 0, 512, 128, 128] # xy plane slice

## Calculate slices and do not save the results
dtx = []
dty = []
dtz = []
for i in range(100):
    time0 = time.time()
    save_slices(roix)
    time1 = time.time()
    dtx.append(time1-time0)

    time0 = time.time()
    save_slices(roiy)
    time1 = time.time()
    dty.append(time1-time0)


    time0 = time.time()
    save_slices(roiz)
    time1 = time.time()
    dtz.append(time1-time0)

plt.figure(1)
plt.plot(dtx)
plt.plot(dty)
plt.plot(dtz)
plt.title('time to run code without saving data')

print('mean time x-slice: {} sec'.format(np.mean(dtx)))
print('mean time y-slice: {} sec'.format(np.mean(dty)))
print('mean time z-slice: {} sec'.format(np.mean(dtz)))


## Calculate slices and do save the results
dtx = []
dty = []
dtz = []
for i in range(100):
    time0 = time.time()
    save_slices(roix, save=True)
    time1 = time.time()
    dtx.append(time1-time0)

    time0 = time.time()
    save_slices(roiy, save=True)
    time1 = time.time()
    dty.append(time1-time0)


    time0 = time.time()
    save_slices(roiz, save=True)
    time1 = time.time()
    dtz.append(time1-time0)

plt.figure(2)
plt.plot(dtx)
plt.plot(dty)
plt.plot(dtz)
plt.title('time to run code and save data')

print('mean time x-slice: {} sec'.format(np.mean(dtx)))
print('mean time y-slice: {} sec'.format(np.mean(dty)))
print('mean time z-slice: {} sec'.format(np.mean(dtz)))

Can you time a slice followed by a copy? A slice makes a `view`, which is a new array with shared data buffer (so no iteration or copying of data). `np.save` will write the data of a copy to the file, not the whole data of the source array. — hpaulj, Jun 06 '19 at 16:27

Agost Biro · Answer 1 · 2019-06-07T14:51:48.130

The reason is that Numpy stores data in row major order by default. If you change

data = np.zeros((512,512,256))

to

# order F means column major
data = np.zeros((512,512,256), order='F')

You will see that saving X-slices will take longest.

If you're going to save multiple slices of the XY plane (when you vary to Z coordinate) you will see better performance by transposing the array and making a copy of it to force a new memory layout. This will make sure that the memory layout fits your access pattern, thus resulting in faster reading (and saving). More detailed explanation below.

Let's take as an example the following matrix (from the Numpy glossary):

m = [[1, 2, 3],
     [4, 5, 6]]

If this represented in memory in row major order (C order in numpy lingo), it is laid out like this:

[1, 2, 3, 4, 5, 6]

If the matrix is represented in memory in column major order (or F for Fortran order), it is laid out like this:

[1, 4, 2, 5, 3, 6]

Now if you index into this array with m[:,2] you get [3, 6], and with m[1,:], you get [4, 5, 6]. If you look back at the memory layouts, you will see that the values [3, 6] will be contiguous in the column major representation and [4, 5, 6] will be contiguous in the row major representation.

When reading lots of elements from an array (as is the case when saving one), it is much better performance-wise to read these values contiguously, because that allows you to take advantage of the CPU cache which is 1-2 orders of magnitude faster than reading from memory.

Many thanks for the reply. It was really helpful to understand the issue with the memory. Is there any way write slices of data to files so that I avoid this issue? Or is it just impossible to avoid the slower speed? — Petros, Jun 07 '19 at 10:29
In the use case described in your question, you will see better performance by transposing the array (to make the last axis the first one) when you're taking slices of the XY plane. In general, when using a row-major memory layout, you want to make sure that you're varying coordinates in the first dimension for best reading (and thus saving) performance. Btw, I think my answer better answers your question and was posted prior to the other response. Please consider accepting mine instead. — Agost Biro, Jun 07 '19 at 14:47

BlueSheepToken · Accepted Answer · 2019-06-07T11:08:05.563

Short answer

Only the roix array is c_contiguous. So the transfer with the bus from memory to CPU is faster than for non contiguous data (Due to the fact the bus moves data in chunk and caches them)

You can have a small improvement (About 5% for the roiz, and 40% for the roiy), by making it C contiguous np.save(file, np.asarray(a, order='C'))

More explanation

Profiling

You should use timeit to time your performances instead of custom methods.

I have done them for you to show an example:

In a cell we got:

import os
import numpy as np
import time
import matplotlib.pyplot as plt

# take a slice of the data
def slice_data(roi):
    dic = {}
    data = np.zeros((512,512,256))
    dic['data'] = np.squeeze( data[roi[0]:roi[1]+1, roi[2]:roi[3]+1, roi[4]:roi[5]+1] )
    return dic


# save slices if the data
def save_slices(roi, save=False):
    var = 'data'
    for i in range(0,6):
                # iterate to simulate a time series of data
        a = slice_data(roi)[var]
        var_dir = 'save_test/'
        if not os.path.exists(var_dir): os.makedirs(var_dir)
        file = var_dir + '{0:04d}{1}'.format(i,'.npy')

        if save is True:
            np.save(file, a)


## define slices
roix=[256, 256, 0, 512, 0, 256] # yz plane slice
roiy=[0, 512, 256, 256, 0, 256] # xz plane slice
roiz=[0, 512, 0, 512, 128, 128] # xy plane slice

in the others :

%%timeit -n 100
save_slices(roix) # 19.8 ms ± 285 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

%%timeit -n 100
save_slices(roiy) # 20.5 ms ± 948 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

%%timeit -n 100
save_slices(roiz) # 20 ms ± 345 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

With the save

%%timeit -n 10 -r 3
save_slices(roix, True) # 32.7 ms ± 2.31 ms per loop (mean ± std. dev. of 3 runs, 10 loops each)

%%timeit -n 10 -r 3
save_slices(roiy, True) # 101 ms ± 2.61 ms per loop (mean ± std. dev. of 3 runs, 10 loops each)

%%timeit -n 10 -r 3
save_slices(roix, True) # 1.9 s ± 21.1 ms per loop (mean ± std. dev. of 3 runs, 10 loops each)

So, as you already noticed, without saving, the performances are the same ! Let's gid into the np.save() method

Np.save method

The np.save takes care of the io streaming, and call write_array method. Which is really fast for C_contigous array. (Fast access memory)

Let's verify this hypothesis:

np.squeeze( np.zeros((512,512,256))[roix[0]:roix[1]+1, roix[2]:roix[3]+1, roix[4]:roix[5]+1] ).flags.c_contiguous # returns True

np.squeeze( np.zeros((512,512,256))[roiy[0]:roiy[1]+1, roiy[2]:roiy[3]+1, roiy[4]:roiy[5]+1] ).flags.c_contiguous # returns False

np.squeeze( np.zeros((512,512,256))[roiz[0]:roiz[1]+1, roiz[2]:roiz[3]+1, roiz[4]:roiz[5]+1] ).flags.c_contiguous # returns False

So this might explain the difference between roix and roiy/roiz.

Potential explanation for the difference between `roiy` and `roiz`. The data transfer slows down the program

After that, I could only make assumptions, roiz seems like to be a lot more fragmented than roiy. Which takes a lot of time for the write_array method.

I cannot test this myself right now, but this part could be verifiable using the perf command in linux . (To see the number of time buses are used, the number of cache-misses for instance). If I had to do a wild guess, I would say the cache-misses is quite high due to the fact the data are not contiguous. So the transfer of the data from the RAM to the CPU really slows down the process.

Other ways to deal with the storage

I have not tried, but there is a nice question with some useful answers : best way to preserve numpy arrays on disk

Many thanks for the reply. It was really helpful to understand the issue with the memory. If I understood correctly, unfortunately, there is no way to avoid this slower file write because in order to write the file, np.save copies the file in memory to make it continuous and the slowdown. Is there any other way to write the data to avoid this issue? — Petros, Jun 07 '19 at 10:27
I have updated my answer to give you a really small improvement... You can have a look as other ways to store it, Here is a link to a nice question: https://stackoverflow.com/questions/9619199/best-way-to-preserve-numpy-arrays-on-disk, you might want to try the other storages :) — BlueSheepToken, Jun 07 '19 at 11:09
@Petros My pleasure ! I had a good time to challenge myself to understand what was going on under the hood. I will probably try to take the perf of the cache misses to be sure about my last assumption ! :) — BlueSheepToken, Jun 07 '19 at 14:36
You can have a big improvement, by changing the memory layout to fit your access pattern. Please see the other answer. — Agost Biro, Jun 07 '19 at 14:50