Why does numpy not short-circuit on non-contiguous arrays?

Question

Consider the following simple test:

import numpy as np
from timeit import timeit

a = np.random.randint(0,2,1000000,bool)

Let us find the index of the first True

timeit(lambda:a.argmax(), number=1000)
# 0.000451055821031332

This is reasonably fast because numpy short-circuits.

It also works on contiguous slices,

timeit(lambda:a[1:-1].argmax(), number=1000)
# 0.0006490410305559635

But not, it seems, on non-contiguous ones. I was mainly interested in finding the last True:

timeit(lambda:a[::-1].argmax(), number=1000)
# 0.3737605109345168

UPDATE: My assumption that the observed slowdown was due to not short circuiting is inaccurate (thanks @Victor Ruiz). Indeed, in the worst-case scenario of an all False array

b=np.zeros_like(a)
timeit(lambda:b.argmax(), number=1000)
# 0.04321779008023441

we are still an order of magnitude faster than in the non-contiguous case. I'm ready to accept Victor's explanation that the actual culprit is a copy being made (timings of forcing a copy with .copy() are suggestive). Afterwards it doesn't really matter anymore whether short-circuiting happens or not.

But other step sizes != 1 yield similar behavior.

timeit(lambda:a[::2].argmax(), number=1000)
# 0.19192566303536296

Question: Why does numpy not short-circuit UPDATE without making a copy in the last two examples?

And, more importantly: Is there a workaround, i.e. some way to force numpy to short-ciruit UPDATE without making a copy also on non-contiguous arrays?

I vaguely recall looking at this short-circuiting for a SO quite some time ago. It applies to bool and the `nan` in floats. I think it's implemented at a very low level in the C code, So Victor's idea that your strides creates a copy makes some sense. — hpaulj, Aug 04 '19 at 16:25
Choosing a good test subject for short-circuiting can be tricky. Do you want it to quit early in the iteration, around the middle, or way at the end. Timings vary if it does short circuit. — hpaulj, Aug 04 '19 at 16:26
@hpaulj _"So Victor's idea that your strides creates a copy makes some sense."_ The timings certainly support it. — Paul Panzer, Aug 04 '19 at 16:56

Victor Ruiz · Answer 1 · 2019-08-11T19:33:54.143

The problem is related with the memory alignment of the array when using strides. Either a[1:-1], a[::-1] are considered aligned in memory but a[::2] dont:

a = np.random.randint(0,2,1000000,bool)

print(a[1:-1].flags.c_contiguous) # True
print(a[::-1].flags.c_contiguous) # False
print(a[::2].flags.c_contiguous) # False

This explains why np.argmax is slow on a[::2] (from documentation on ndarrays):

Several algorithms in NumPy work on arbitrarily strided arrays. However, some algorithms require single-segment arrays. When an irregularly strided array is passed in to such algorithms, a copy is automatically made.

np.argmax(a[::2]) is making a copy of the array. So if you do timeit(lambda: np.argmax(a[::2]), number=5000) you are timing 5000 copies of the array a

Execute this and compare the results of this two timing calls:

print(timeit(lambda: np.argmax(a[::2]), number=5000))

b = a[::2].copy()
print(timeit(lambda: np.argmax(b), number=5000))

EDIT: Digging into the source code in C of numpy, i found the underline implementation of argmax function, PyArray_ArgMax which calls at some point to PyArray_ContiguousFromAny to ensure that the given input array is aligned in memory (C-style)

Then, if the dtype of the array is bool, it delegates to BOOL_argmax function. Looking at its code, seems that short-ciruit is always applied.

Summary

In order to avoid copies by np.argmax, make sure that the input array is contiguous in memory
short-circuit is always applied when data type is boolean.

The copy I'm buying, but what exactly is the point of this reshape thing? — Paul Panzer, Aug 04 '19 at 16:54
I suspected that was your reasoning. But, alas, it's wrong. The rows are not interleaved. The first row contains the first half a million entries and the second row contains the second half a million entries. What you want to achieve would require reshape to (-1,2) and then argmax along axis zero, but it doesn't make things faster. — Paul Panzer, Aug 04 '19 at 19:39

Victor Ruiz · Accepted Answer · 2019-08-13T17:17:43.580

3

I got interested in solving this problem. So I`ve come with the next solution that manages to avoid the "a[::-1]" problem case due to internal ndarray copies by np.argmax:

I created a small library that implements a function argmax which is a wrapper of np.argmax, but it has increased performance when the input argument is a 1D boolean array with stride value set to -1:

https://github.com/Vykstorm/numpy-bool-argmax-ext

For those cases, it uses a low-level C routine to find the index k of an item with maximum value (True), starting from the end to the beginning of the array a.
Then you can compute argmax(a[::-1]) with len(a)-k-1

The low-level method doesn't perform any internal ndarray copies because it operates with the array a which is already C-contiguous and aligned in memory. It also applies short-circuit

EDIT: I extended the library to improve the performance argmax also when dealing with stride values different than -1 (with 1D boolean arrays) with good results: a[::2], a[::-3], e.t.c.

Give it a try.

edited Aug 13 '19 at 17:17

answered Aug 12 '19 at 21:56

Victor Ruiz

1,192
9
9

1

Nice! Any reason not to do it for arbitrary stride? And to make a PR at numpy proper? – Paul Panzer Aug 13 '19 at 00:37
I thought it wasnt a good solution for general strides because the algorithm will not take advantage of spacial locality (if the stride is big). But i can give it a try. – Victor Ruiz Aug 13 '19 at 07:29
1

You still have the benefit of short circuiting (compared to the copy operation). Also according to the gurus [hardware prefetch](https://stackoverflow.com/a/47714514/7207392) is rather good these days. – Paul Panzer Aug 13 '19 at 07:58
Updated my answer – Victor Ruiz Aug 13 '19 at 17:18
I didnt made a PR because i just learnt cpython while doing the library and im not an expert of numpy – Victor Ruiz Aug 13 '19 at 17:21

Why does numpy not short-circuit on non-contiguous arrays?

2 Answers2

Summary