The accessing time of a numpy array is impacted much more by the last index compared to the second last

Question

This is a follow up to this answer to my previous question Fastest approach to read thousands of images into one big numpy array.

In chapter 2.3 "Memory allocation of the ndarray", Travis Oliphant writes the following regarding how indexes are accessed in memory for C-ordered numpy arrays.

...to move through computer memory sequentially, the last index is incremented first, followed by the second-to-last index and so forth.

This can be confirmed by benchmarking the accessing time of 2-D arrays either along the two first or the two last indexes (for my purposes, this is a simulation of loading 500 images of size 512x512 pixels):

import numpy as np

N = 512
n = 500
a = np.random.randint(0,255,(N,N))

def last_and_second_last():
    '''Store along the two last indexes'''
    imgs = np.empty((n,N,N), dtype='uint16')
    for num in range(n):
        imgs[num,:,:] = a
    return imgs

def second_and_third_last():
    '''Store along the two first indexes'''
    imgs = np.empty((N,N,n), dtype='uint16')
    for num in range(n):
        imgs[:,:,num] = a
    return imgs

Benchmark

In [2]: %timeit last_and_second_last()
136 ms ± 2.18 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

In [3]: %timeit second_and_third_last()
1.56 s ± 10.9 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

So far so good. However, when I load arrays along the last and third last dimension, this is almost as fast as loading them into the two last dimensions.

def last_and_third_last():
    '''Store along the last and first indexes'''
    imgs = np.empty((N,n,N), dtype='uint16')
    for num in range(n):    
        imgs[:,num,:] = a
    return imgs

Benchmark

In [4]: %timeit last_and_third_last()
149 ms ± 227 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

Why is it that last_and_third_last() is so my closer in speed to last_and_second_last() compared to second_and third_last()?
What's a good way to visualize why the last index matters much more than the second last index in regards to the accessing speed?

score 5 · Accepted Answer · edited Jun 20 '20 at 09:12

I'll try to illustrate the indexing, without getting into details of processor caching etc.

Lets make a small 3d array with distinctive element values:

In [473]: X = np.mgrid[100:300:100,10:30:10,1:4:1].sum(axis=0)
In [474]: X
Out[474]: 
array([[[111, 112, 113],
        [121, 122, 123]],

       [[211, 212, 213],
        [221, 222, 223]]])
In [475]: X.shape
Out[475]: (2, 2, 3)

ravel views it as a 1d array, and shows us how the values are laid out in memory. (This, incidentally,is with the default C ordering)

In [476]: X.ravel()
Out[476]: array([111, 112, 113, 121, 122, 123, 211, 212, 213, 221, 222, 223])

When I index on the 1st dimension I get 2*3 values, a contiguous block of the above list:

In [477]: X[0,:,:].ravel()
Out[477]: array([111, 112, 113, 121, 122, 123])

Indexing instead on the last gives 4 values, selected from across the array - I've added .. to highlight that

In [478]: X[:,:,0].ravel()
Out[478]: array([111,.. 121,.. 211,.. 221])

Indexing on the middle gives me 2 contiguous subblocks, i.e. 2 rows of X.

In [479]: X[:,0,:].ravel()
Out[479]: array([111, 112, 113,.. 211, 212, 213])

With the strides and shape calculations numpy can access any one element in X in (about) the same time. And in the X[:,:,i] case that's what it has to do. The 4 values are 'scattered' across the databuffer.

But if it can access contiguous blocks, such as in X[i,:,:], it can delegate more of the action to low level compiled and processor code. With X[:,i,:] those blocks aren't quite as big, but may still be big enough to make a big difference.

In your test case, [n,:,:] iterates 500 times on 512*512 element blocks.

[:,n,:] has to divide that access into 512 blocks of 512 each.

[:,:,n] has to do 500 x 512 x 512 individual accesses.

I wonder if working with uint16 exaggerates the effect. In another question we just showed that calculation with float16 is much slower (up to 10x) because the processor (and compiler) is tuned to work with 32 and 64 bit numbers. If the processor is tuned to moving blocks of 64bit numbers around, then moving an isolated 16 bit number could require a lot of extra processing. It would be like doing a copy-n-paste from a document word-by-word, when copying line-by-line requires fewer key strokes per copy.

The exact details are buried in the processor, operating system and compiler, as well as numpy code, but hopefully this gives a feel for why your middle case falls much closer to the optimum than to the worst case.

On testing - setting imgs to a.dtype slows things down a bit for all cases. So the 'uint16' isn't causing any special problems.

Why does `numpy.einsum` work faster with `float32` than `float16` or `uint16`?

Aaron · Answer 2 · 2017-05-22T18:45:00.210

4

Numpy's arrays are built on c and c++ so we get to think about stuff like cache lines when we're pushing it to it's absolute limits. In last_and_second_last(): and last_and_third_last(): you are reading more than a single byte along the last axis, so an entire cache line is used at a time (16 actually as your last axis is 1024 bytes long). In second_and_third_last(), an entire cache line must be copied to read (or write) a single value in the last axis. Modern c compilers (and others: fortran, etc.) will take nested loops that access memory in the wrong order like this and silently re-order them to optimize cache usage, but python cannot do this.

Example:

Assume you have a very basic processor where the cache is 4 words wide.
Your array is 4x4x4 arr = np.arange(64).reshape([4,4,4])
If you want to access arr[i,j,:] you can load all of them in the cache at once (ex: [0,1,2,3])
If you want to access arr[i,:,k]
- The processor will first load arr[i,0,:] and read [k] from that array of 4
- Then it will load arr[i,1,:] and read [k] from that array of 4
- etc...

edited May 22 '17 at 18:45

answered May 22 '17 at 15:03

Aaron

10,133
1
24
40

Basically as long as your last axis is used, and is at least 64 bytes, you ought to see peak performance. Only using a single value from the last axis is basically a worst case scenario memory-wise. – Aaron May 22 '17 at 15:10
If you tell numpy to use fortran order, you'd find a similar effect only with the first axis instead of the last. – Aaron May 22 '17 at 15:14
I don't think that these loops are reorderable by the compiler in numpy - the problem is that the strides of the loop are themselves only known at runtime. Instead the reordering happens within numpy itself, I think within the `NpyIter` mechanism. – Eric May 22 '17 at 18:38
@Eric I thought I made it clear that python (excluding things like pypy and numba) doesn't compile and therefore doesn't make those automatic optimizations. (therefore the programmer must be cognizant of memory continuity). I've clarified a bit more.. – Aaron May 22 '17 at 18:44

Eric · Answer 3 · 2017-05-22T18:36:06.423

2

The key point here is not that it is the last axis, but that it is the axis ax where imgs.strides[ax] == imgs.dtype.itemsize — that is, the memory is contiguous along that axis. The default behaviour is to apply this to the last axis, but don't assume that - you'll see the opposite behaviour with imgs.T (as this creates a view, reversing the strides array)

When NumPy detects that axes are contiguous, it uses a memcpy on the entire dimension, which compilers optimize significantly. In other cases, NumPy is only able to memcpy one element at a time

edited May 22 '17 at 18:36

answered May 22 '17 at 16:14

Eric

95,302
53
242
374

Your first paragraph is about `order C` vs `order F`. The `imgs.T` reference may be confusing if you don't already know that it is a view produced by changing the `order` (and strides and shape). – hpaulj May 22 '17 at 16:35
Right, but I was trying to generalize to arrays that are neither "order C" nor "order F" – Eric May 22 '17 at 18:35

score 0 · Answer 4 · answered May 25 '17 at 03:58

A key for me was to understand that rows in a C-ordered numpy array are appended to each other to form a continuous block/buffer in memory, similar to the layout @hpaulj showed with .ravel(). Reading more about how arrays work in C further helped with that understanding, especially these three resources:

Since it is very efficient to retrieve elements laid out continuously in memory, the costly parts of accessing an array becomes the lookups, which is when parts of the continues memory chunks are skipped over to continue reading at another memory location. Aaron's answer outlines reasons to why this is an expensive operation.

As @hpaulj pointed out, the [n,:,:] approach does the least amount of lookups and the [:,:,n] approach does by far the most, which explains why this method lags significantly behind the other two:

[n,:,:] = 500 lookups, each retrieving 512*512 (262144) sequential elements.
[:,n,:] = 500*512 (256000) lookups, each retrieving 512 sequential elements.
[:,:,n] = 500*512*512 (131072000) lookups, each retrieving 1 element.

The accessing time of a numpy array is impacted much more by the last index compared to the second last

4 Answers4

Linked