Why is repeated numpy array access faster using a single-element view?

Question

I saw in another SO thread that it's possible to create a single-element view of an array arr with arr[index:index+1]. This is useful to me since I need to set several values of a (possibly large ~100k entries) array repeatedly. But before I just used that approach, I wanted to make sure that extra work of creating the view doesn't cost too much time. Surprisingly, I've found that if you access an index at least ~10 times, you're already better off using the view.

The data for this plot was created by timing the two aproaches (in python 3.10):

#!/bin/python3
# https://gist.github.com/SimonLammer/7f27fd641938b4a8854b55a3851921db

from datetime import datetime, timedelta
import numpy as np
import timeit

np.set_printoptions(linewidth=np.inf, formatter={'float': lambda x: format(x, '1.5E')})

def indexed(arr, indices, num_indices, accesses):
    s = 0
    for index in indices[:num_indices]:
        for _ in range(accesses):
            s += arr[index]

def viewed(arr, indices, num_indices, accesses):
    s = 0
    for index in indices[:num_indices]:
        v = arr[index:index+1]
        for _ in range(accesses):
            s += v[0]
    return s

N = 11_000 # Setting this higher doesn't seem to have significant effect
arr = np.random.randint(0, N, N)
indices = np.random.randint(0, N, N)

options = [1, 2, 3, 5, 8, 13, 21, 34, 55, 89, 144, 233, 377, 610, 987, 1597, 2584, 4181, 6765, 10946]
for num_indices in options:
    for accesses in options:
        print(f"{num_indices=}, {accesses=}")
        for func in ['indexed', 'viewed']:
            t = np.zeros(5)
            end = datetime.now() + timedelta(seconds=2.5)
            i = 0
            while i < 5 or datetime.now() < end:
                t += timeit.repeat(f'{func}(arr, indices, num_indices, accesses)', number=1, globals=globals())
                i += 1
            t /= i
            print(f"  {func.rjust(7)}:", t, f"({i} runs)")

These observations are very counterintuitive to me. Why is viewed faster than indexed (for more than 10 accesses per index)?

Edit 1:

gist crossreference: https://gist.github.com/SimonLammer/7f27fd641938b4a8854b55a3851921db
r/Numpy crossreference: https://www.reddit.com/r/Numpy/comments/wb4p12/why_is_repeated_numpy_array_access_faster_using_a/

Edit 2:

I could replicate Jérôme Richard's findings. The culprit is the index datatype (python int vs. numpy int):

>>> import timeit
>>> timeit.timeit('arr[i]', setup='import numpy as np; arr = np.random.randint(0, 1000, 1000); i = np.random.randint(0, len(arr), 1)[0]', number=20000000)
1.618339812999693
>>> timeit.timeit('arr[i]', setup='import numpy as np; arr = np.random.randint(0, 1000, 1000); i = np.random.randint(0, len(arr), 1)[0]; i = int(i)', number=20000000)
1.2747555710002416

With just `options = [1000]`, I get ~0.20 vs ~0.16. But if I then insert `index = 342` in `indexed` inside the `for index` loop before the `for _` loop, then `indexed` becomes ~0.16 as well. — Kelly Bundy, Jul 30 '22 at 16:35
(I also tried equally inserting `index = 342` into `viewed`. That didn't affect its time.) — Kelly Bundy, Jul 30 '22 at 16:51

Jérôme Richard · Accepted Answer · 2022-08-01T12:02:37.630

Since num_indices have not significant impact on the observed performance, we can simplify the problem by discarding this parameter (ie. set to 1). Since only large accesses matters, we can also simplify the problem by considering only a large value like 10946 for example. The use of index can also be simplified without affecting the benchmark. The same thing applies for the return statement. The simplified problem is now why we get this (reproduced on CPython 3.10.5):

import numpy as np

def indexed(arr, index):
    s = 0
    for _ in range(10946): s += arr[index]

def viewed(arr, index):
    s = 0
    v = arr[index:index+1]
    for _ in range(10946): s += v[0]

N = 11_000
arr = np.random.randint(0, N, N)
indices = np.random.randint(0, N, N)

# mean ± std. dev. of 7 runs, 1000 loops each
%timeit indexed(arr, indices[0])       # 1.24 ms ± 22.3 µs per loop
%timeit viewed(arr, indices[0])        # 0.99 ms ± 4.34 µs per loop

Now, the source of the slowdown is pretty limited. It only has to do with arr[index] versus v[0]. It is also important to note that arr and v are basically of the same type meanwhile index and 0 are not of the same type. Indeed, index if of type np.int64 while 0 is a PyLong object. The thing is Numpy item types are much slower than builtin ones. Put it shortly, CPython functions are called by Numpy in both cases but PyLong objects can be computed faster than generic PyNumber-based datatypes (by CPython itself). Numpy checks are also ordered so the first type can be also faster. For more information, see Section "* Deeper analysis & Discussion*".

To fix the problem, you can just convert the Numpy type to a builtin one:

import numpy as np

def indexed(arr, index):
    s = 0
    nativeIndex = int(index)  # <------------------------------
    for _ in range(10946): s += arr[nativeIndex]

def viewed(arr, index):
    s = 0
    v = arr[index:index+1]
    for _ in range(10946): s += v[0]

N = 11_000
arr = np.random.randint(0, N, N)
indices = np.random.randint(0, N, N)

# mean ± std. dev. of 7 runs, 1000 loops each
%timeit indexed(arr, indices[0])       # 981 µs ± 4.6 µs per loop
%timeit viewed(arr, indices[0])        # 989 µs ± 5.3 µs per loop
# The difference is smaller than the sum of the standard deviations 
# so the gap is clearly not statistically significant anymore.

Besides, note that nearly all the time of the two functions is pure overhead. Numpy is not designed for doing scalar access, but optimized for vectorized ones. A naive expression like v[0] cause a LOT of work to be done by the processor: the expression needs to be interpreted, a new reference-counted object need to be allocated, several (internal) functions of Numpy needs to be called with many overheads (wrapping, dynamic type checking, internal iterator configuration). Low-level profilers reports dozens of function calls to be called and 250-300 cycles wasted for something that should take no more than 1 cycle of throughput on a modern x86 processor.

Deeper analysis & Discussion

When the expression arr[nativeIndex] (or v[0]) are evaluated, the CPython interpreter end up calling Numpy functions and especially the generic array_subscript. This function calls prepare_index which do some type checks in the operand. This function is slower with non built-in types. One reason is that PyArray_PyIntAsIntp calls CPython type-checking functions and the code for checking PyLong type is faster than the much more generic alternative that "covers everything" (including np.int64, to quote a comment in the code). Additionally, there is a cascade of checks and types causing early breaks tends to also results in a faster check stage (because of the cumulative cost of all CPython function calls). Because of that, Numpy developers tends to put more common cases first. That being said, this is not so simple because the cost of checking each type is not the same (so putting a common type slow to check first can slow down all other cases) and there are sometimes dependencies in the code. Also please note that the CPython interface prevent Numpy to do further optimizations (that CPython does in builtins) and make the more generic type also a bit slower because of this. In fact, this is what makes builtin function faster (though it is generally not the only reason).

Related post: Why is np.sum(range(N)) very slow?

I can reproduce the speedup, but I'm not convinced that it's because *"the interpreter can use builtin functions on builtin types"*. What builtin functions does the interpreter use here on the index object? Does the interpreter do anything with the index object other than incrementing its reference counter and passing it on to the compiled NumPy code (which then actually does something with it)? — Kelly Bundy, Aug 01 '22 at 01:35
Well, outside the Numpy code, in this case, yes. But, the thing is the CPython interpreter functions are called from Numpy internal functions, for example, when checking parameters and this is typically at this time where builtins are faster than a custom type requiring more operations. Additionally, operations like additions, comparisons etc. are slower on Numpy types and they can be used by either Numpy or the interpreter (but I do not think this is the case here). I will add a not about this. I wanted to avoid describing the weird of Numpy internals (especially the checking part). — Jérôme Richard, Aug 01 '22 at 03:15
Thanks for the update. That's more like what I suspected to be happening. Those functions (array_subscript, prepare_index, PyArray_PyIntAsIntp) are NumPy functions, and if they wanted to, they could check for NumPy ints first and then those would be faster instead, right? (I don't consider the functions called from there "interpreter"... for me, that's non-code data handling, and if we consider all data handling "interpretation", then everything is interpretation.) — Kelly Bundy, Aug 01 '22 at 04:22
Thank you very much. Now the (python) world makes sense again :) — GitProphet, Aug 01 '22 at 08:25
@KellyBundy It is not just about the order, they need a reference to be decremented (`Py_DECREF`) which is not needed for PyLong based on the code and they need to exclude PyLong from the list and `PyNumber_Index` is a CPython function like others. These function are intrinsically parts of CPython which is an interpreter (PyPy do not have this for example). Dynamic type checking, object conversion and the related runtime error system are strongly related to interpreters. That being said, it is not part of the fetch/decode logic so I partially agree. Let call it just "CPython". — Jérôme Richard, Aug 01 '22 at 11:45
@KellyBundy `PyNumber_Index`+`Py_DECREF` are slower than the PyLong check/extraction and Numpy cannot do anything about that because of the CPython interface here. CPython can make internal assumption on its own code though and this is why builtins tends to be fast. Numpy cannot do that safely because it would break the interface. Relying on a specific implementation of CPython is not a good idea especially since it can change (and has already recently changed actually). — Jérôme Richard, Aug 01 '22 at 11:47

GitProphet · Answer 2 · 2022-07-30T16:13:27.930

Update: I can't replicate the timings of this answer anymore. Maybe I have done something in a setup step that changed these results; or they were just coincidence.

>>> arr = np.random.randint(0, 1000, 1000)
>>> i = 342
>>> def a3(i): return arr[i]
...
>>> def b3(i): return arr[342]
...
>>> def c3(i): return arr[0]
...
>>> t = timeit.repeat('a3(i)', globals=globals(), number=100000000); print(t, np.mean(t), np.median(t))
[17.449311104006483, 17.405843814995023, 17.91914719599299, 18.123263651999878, 18.04744581299019] 17.789002315996914 17.91914719599299
>>> t = timeit.repeat('b3(i)', globals=globals(), number=100000000); print(t, np.mean(t), np.median(t))
[17.55685576199903, 18.099313585989876, 18.032570399998804, 18.153590378991794, 17.628647994992207] 17.894195624394342 18.032570399998804
>>> t = timeit.repeat('c3(i)', globals=globals(), number=100000000); print(t, np.mean(t), np.median(t))
[17.762766532003297, 17.826293045000057, 17.821444382003392, 17.618322997994255, 17.488862683996558] 17.703537928199513 17.762766532003297

The timing difference seems to be caused by loading a variable vs. loading a constant.

import numpy as np
import dis
arr = np.random.randint(0, 1000, 1000)

def a3(i):
    return arr[i]
def b3(i):
    return arr[342]
def c3(i):
    return arr[0]

The difference in these functions is just the way of indexing the array with i, 342 or 0.

>>> dis.dis(a3)
  2           0 LOAD_GLOBAL              0 (arr)
              2 LOAD_FAST                0 (i)
              4 BINARY_SUBSCR
              6 RETURN_VALUE
>>> dis.dis(b3)                                                                   
  2           0 LOAD_GLOBAL              0 (arr)
              2 LOAD_CONST               1 (342)
              4 BINARY_SUBSCR
              6 RETURN_VALUE
>>> dis.dis(c3)                                                                   
  2           0 LOAD_GLOBAL              0 (arr)
              2 LOAD_CONST               1 (0)
              4 BINARY_SUBSCR
              6 RETURN_VALUE

The variable index is (~8%) slower than a constant index, and a constant index 0 is (~5%) faster still. Accessing the array at index 0 (c3) is (~13%) faster than the variable index (a3).

>>> t = timeit.repeat('a3(i)', globals=globals(), number=10000000); print(t, np.mean(t), np.median(t))
[1.4897515250049764, 1.507482559987693, 1.5573357169923838, 1.581711255988921, 1.588776800010237] 1.5450115715968422 1.5573357169923838
>>> t = timeit.repeat('b3(i)', globals=globals(), number=10000000); print(t, np.mean(t), np.median(t))
[1.4514476449985523, 1.427873961001751, 1.4268056689907098, 1.4114146630017785, 1.442651974997716] 1.4320387825981016 1.427873961001751
>>> t = timeit.repeat('c3(i)', globals=globals(), number=10000000); print(t, np.mean(t), np.median(t))
[1.357518576012808, 1.3500928360008402, 1.3615708220022498, 1.376022889991873, 1.3813936790102161] 1.3653197606035974 1.3615708220022498

Thanks to u/jtclimb https://www.reddit.com/r/Numpy/comments/wb4p12/comment/ii7q53s/?utm_source=share&utm_medium=web2x&context=3

Edit 1: Using the setup parameter of timeit.repeat refutes this hypothesis.

>>> t=timeit.repeat('arr[i]', setup='import numpy as np; arr = np.random.randint(0,10000,1000000); i = 342', number=10000000); print(np.around(t, 5), np.mean(t), np.median(t))
[0.7697  0.76627 0.77007 0.76424 0.76788] 0.7676320286031114 0.7678760859998874
>>> t=timeit.repeat('arr[0]', setup='import numpy as np; arr = np.random.randint(0,10000,1000000); i = 342', number=10000000); print(np.around(t, 5), np.mean(t), np.median(t))
[0.76836 0.76629 0.76794 0.76619 0.7682 ] 0.7673966443951941 0.7679443680099212

I'm not convinced. This might instead be showing that it took differently long to look up the three functions. Better measure just the array access expressions (using the `setup` parameter to initialize `arr` and `i` so they're locals). Also, what value did you use for `i`? — Kelly Bundy, Jul 30 '22 at 13:36
I can reproduce your question's original time difference, but not your answer's time differences. — Kelly Bundy, Jul 30 '22 at 13:59
Weirdly enough, I can't replicate the timings anymore either. — GitProphet, Jul 30 '22 at 15:18
Another indication that this isn't it is what I commented under the question now. If this answer were right, then `indexed` with its `arr[index]` should still be slower instead of matching the ~0.16. — Kelly Bundy, Jul 30 '22 at 16:38

Why is repeated numpy array access faster using a single-element view?

2 Answers2

Deeper analysis & Discussion

Linked