NumPy: function for simultaneous max() and min()

Question

numpy.amax() will find the max value in an array, and numpy.amin() does the same for the min value. If I want to find both max and min, I have to call both functions, which requires passing over the (very big) array twice, which seems slow.

Is there a function in the numpy API that finds both max and min with only a single pass through the data?

How big is very big? If I get some time, I'll run a few tests comparing a fortran implementation to `amax` and `amin` — mgilson, Aug 30 '12 at 16:01
I'll admit "very big" is subjective. In my case, I'm talking about arrays that are a few GB. — Stuart Berg, Aug 30 '12 at 17:36
that's pretty big. I've coded up an example to calculate it in fortran (even if you don't know fortran, it should be pretty easy to understand the code). It really makes a difference running it from fortran vs. running through numpy. (Presumably, you should be able to get the same performance from C ...) I'm not sure -- I suppose we would need a numpy dev to comment on why my functions perform so much better than theirs ... — mgilson, Aug 30 '12 at 17:39
Of course, this is hardly a novel idea. For example, the boost [minmax](http://www.boost.org/doc/libs/1_51_0/libs/algorithm/minmax/) library (C++) provides an implementation of the algorithm I'm looking for. — Stuart Berg, Aug 30 '12 at 17:46
Yeah, the idea isn't novel. I suppose the question that I'm wondering is what extra work is `np.min` doing to make it so much slower than my naive version? — mgilson, Aug 30 '12 at 17:47
btw, related question: https://stackoverflow.com/questions/9312756/is-there-a-numpy-max-min-function (though not an exact dupe because this explicitly asks for a single-pass approach that computes both values) — Ken Arnold, Jun 14 '17 at 19:33
Not really an answer to the question asked, but probably of interest to people on this thread. Asked NumPy about adding `minmax` to the library in issue ( https://github.com/numpy/numpy/issues/9836 ). — jakirkham, Nov 11 '17 at 20:05
scipy provides a useful function I use a lot during interactive debugging: scipy.stats.describe(...) whose output includes both minimum and maximum. Don't know if this is fast, however. — Ron Kaminsky, Dec 08 '18 at 09:01

score 72 · Accepted Answer · answered Apr 03 '15 at 17:15

72

Is there a function in the numpy API that finds both max and min with only a single pass through the data?

No. At the time of this writing, there is no such function. (And yes, if there were such a function, its performance would be significantly better than calling numpy.amin() and numpy.amax() successively on a large array.)

answered Apr 03 '15 at 17:15

Stuart Berg

17,026
12
67
99

2

If anyone was wondering as me, whether inplace partial sort with just two indices like so: `x.partition([0, x.size-1])` would be faster than two separate calls of `x.min()` and `x.max()`, then no, it is not. Two separate calls were always faster. I tested for the following sizes of `x`: [1e2, 1e3, 1e4, 1e5, 1e6, 1e7, 1e8] and measured the relative time of separate to partial: [35.53%, 5.10%, 53.33%, 36.89%, 37.65%, 36.03%, 42.89%]. So in case your data vector has size of 100 million samples, separate call of min and max will use only 42.89% of time in comparison to partial sorting. – Paloha Oct 18 '21 at 16:27

score 35 · Answer 2 · edited Mar 04 '15 at 14:47

35

~~I don't think that passing over the array twice is a problem.~~ Consider the following pseudo-code:

minval = array[0]
maxval = array[0]
for i in array:
    if i < minval:
       minval = i
    if i > maxval:
       maxval = i

While there is only 1 loop here, there are still 2 checks. (Instead of having 2 loops with 1 check each). Really the only thing you save is the overhead of 1 loop. If the arrays really are big as you say, that overhead is small compared to the actual loop's work load. (Note that this is all implemented in C, so the loops are more or less free anyway).

EDIT Sorry to the 4 of you who upvoted and had faith in me. You definitely can optimize this.

Here's some fortran code which can be compiled into a python module via f2py (maybe a Cython guru can come along and compare this with an optimized C version ...):

subroutine minmax1(a,n,amin,amax)
  implicit none
  !f2py intent(hidden) :: n
  !f2py intent(out) :: amin,amax
  !f2py intent(in) :: a
  integer n
  real a(n),amin,amax
  integer i

  amin = a(1)
  amax = a(1)
  do i=2, n
     if(a(i) > amax)then
        amax = a(i)
     elseif(a(i) < amin) then
        amin = a(i)
     endif
  enddo
end subroutine minmax1

subroutine minmax2(a,n,amin,amax)
  implicit none
  !f2py intent(hidden) :: n
  !f2py intent(out) :: amin,amax
  !f2py intent(in) :: a
  integer n
  real a(n),amin,amax
  amin = minval(a)
  amax = maxval(a)
end subroutine minmax2

Compile it via:

f2py -m untitled -c fortran_code.f90

And now we're in a place where we can test it:

import timeit

size = 100000
repeat = 10000

print timeit.timeit(
    'np.min(a); np.max(a)',
    setup='import numpy as np; a = np.arange(%d, dtype=np.float32)' % size,
    number=repeat), " # numpy min/max"

print timeit.timeit(
    'untitled.minmax1(a)',
    setup='import numpy as np; import untitled; a = np.arange(%d, dtype=np.float32)' % size,
    number=repeat), '# minmax1'

print timeit.timeit(
    'untitled.minmax2(a)',
    setup='import numpy as np; import untitled; a = np.arange(%d, dtype=np.float32)' % size,
    number=repeat), '# minmax2'

The results are a bit staggering for me:

8.61869883537 # numpy min/max
1.60417699814 # minmax1
2.30169081688 # minmax2

I have to say, I don't completely understand it. Comparing just np.min versus minmax1 and minmax2 is still a losing battle, so it's not just a memory issue ...

notes -- Increasing size by a factor of 10**a and decreasing repeat by a factor of 10**a (keeping the problem size constant) does change the performance, but not in a seemingly consistent way which shows that there is some interplay between memory performance and function call overhead in python. Even comparing a simple min implementation in fortran beats numpy's by a factor of approximately 2 ...

edited Mar 04 '15 at 14:47

Stuart Berg

17,026
12
67
99

answered Aug 30 '12 at 15:44

mgilson

300,191
65
633
696

1

Indeed, it's still O(n) time and practically as fast. – mike3996 Aug 30 '12 at 15:45
25

The advantage of a single pass is memory efficiency. Particularly if your array is large enough to be swapped out, this could be huge. – Danica Aug 30 '12 at 15:55
5

Thats not quite true, its almost half as fast, because with these kind of arrays, the memory speed is usually the limiting factor, so it can be half as fast... – seberg Aug 30 '12 at 15:57
@Dougal -- That's a reasonable point that I hadn't thought of ... memory is certainly a problem for out of order access, but I would expect that iterating over a regular array like this wouldn't be much of a problem since the processor could very reliably predict which elements it needs and pre-fetch as necessary. Is that not the case? – mgilson Aug 30 '12 at 15:58
@mgilson I don't know enough about what's going on at a low level here to say either way. It'd be interesting to code up such a function by grafting together the numpy min / max functions and running some tests.... – Danica Aug 30 '12 at 16:09
10

You don't always need two checks. If `i < minval` is true, then `i > maxval` is always false, so you only need to do 1.5 checks per iteration on average when the second `if` is replaced by an `elif`. – Fred Foo Aug 30 '12 at 16:35
@larsmans -- good point. It makes a slight difference too (see my f2py code above). There's definitely more going on here which I don't understand though... – mgilson Aug 30 '12 at 17:20
2

Small note: I doubt Cython is the way to get the most optimized Python-callable C module. Cython's goal is to be a sort of type-annotated Python, which is then machine-translated to C, whereas `f2py` just wraps hand-coded Fortran so that it is callable by Python. A "fairer" test is probably hand-coding C and then using `f2py` (!) to wrap it for Python. If you're allowing C++, then Shed Skin may be the sweet spot for balancing coding ease with performance. – John Y Aug 30 '12 at 18:19
@JohnY -- Have you ever actually used `f2py` to wrap C code? I've seen it mentioned in the docs on occasion, but never with any concrete examples ... I would guess that you could write Cython code to translate almost perfectly to what I wrote in fortran (if you were familiar enough with Cython). Is that not the case? Also, as a matter of personal preference, I never allow C++ ;^), but I suppose OP is free to use any language ... – mgilson Aug 30 '12 at 18:30
@mgilson: I have not personally used `f2py` at all. I was going by [this](http://www.scipy.org/Cookbook/f2py_and_NumPy), which you may or may not consider a "concrete example". – John Y Aug 30 '12 at 19:13
@mgilson: As for Cython, since it's *machine-generated* C, I am not convinced it would be easier (or any closer to Fortran) than plain old hand-coded C. I mean, for sure I'm no expert, but the Cython-generated C code I've seen is not pretty. It's good for generating a C module *that works*, and that module is usually faster than pure Python, but it's not the C code a C programmer would write. – John Y Aug 30 '12 at 19:16
@JohnY -- You might be right about Cython. I worked at learning it for about 2 hours before I said to myself "This is stupid, I know fortran better than I know C and know how to use f2py with it -- why am I bothering with Cython?". Of course, at some point, I'm sure I'll wish I had stuck with it ;). As for the link you posted, that is exactly what I was asking about. Thanks for finding it. – mgilson Aug 30 '12 at 19:22
As a point, 1. you did use float32, so memory speed is less of an issue maybe. Since it is, maybe things like fortran using SSE might come into it. I tried Cython (on a double array), it was maybe 1.3x slower then numpy itself but max 1.4 slower when doing minmax at the same time. – seberg Aug 30 '12 at 20:40
@JohnY: the generated C code from Cython isn't pretty, but in my experience, `gcc -O2` optimizes it to roughly the performance level of hand-written C code. – Fred Foo Aug 31 '12 at 13:16
1

@mgilson: I would guess that fetching the array from the main memory into the processor *twice* instead of once explains part of the difference in performance. Comparing the speeds for arrays that are smaller than the processor caches would be a good way of measuring this effect. – Eric O. Lebigot Jan 12 '14 at 12:24
1

Memory fetching is indeed the answer here. Looping twice for a single-cycle opn is not a good idea on todays architectures; GPU's or not. Accessing RAM is ~100 cycles or so. C compilers basically do zilch when it comes to prefetchting unless told to; and given the generality that compiled numpy functions aim for (all kinds of array sizes and stridings), compiling in prefetch ops is a nogo. Fortran, as an array processing language, is much more aggressive on this front. It demands F-contiguity at compile time, and can prefetch accordingly. This probably explains the difference observed here. – Eelco Hoogendoorn Jan 12 '14 at 13:17
how much faster this is than doing min followed by max. No accusation, an insinuation maybe... – Andy Hayden Mar 06 '14 at 07:21
@AndyHayden -- Surprising what a complier can do with a simple loop ... I guess numpy has a lot more stuff that it needs to deal with to handle strides and all that whatnot correctly. – mgilson Mar 06 '14 at 07:23
Yeah, definitely a lot of overhead, but a one pass min-max library function would be great (maybe I'll stick one in pandas algos...) – Andy Hayden Mar 06 '14 at 07:31
5

as of numpy 1.8 min and max are vectorized on amd64 platforms,on my core2duo numpy performs as well as this fortran code. But a single pass would be advantageous if the array exceed the size of the larger cpu caches. – jtaylor Mar 07 '14 at 13:58
1

@jtaylor as of numpy v1.10.1 and gcc v4.9.2 on a dual-core processor, the numpy version is actually faster. Numpy takes only `68.1%` of minmax1's time to complete. The numpy version is also finished after `25.5%` of minmax2's time. – mab Dec 01 '15 at 22:54
That's definitely interesting. I wonder if passing more aggressive optimization flags to `f2py` would bring them to be more comparable. – mgilson Dec 01 '15 at 22:58
turning the else if into an if might be enough, min and max can be done without branching in hardware but as written the compiler will not do that – jtaylor Dec 02 '15 at 23:36
WARNING! WEIRD RESULT. As of March'20 with python 3.7.6 , numpy 1.18.1 on MacOS X (macbook mid'12) , the Method numpy 1.18.1 min/max is faster UNLESS its results are used for further computation!!! . This is, when tested coded above: # numpy min/max ... 0.554 # fortran minmax1 ... 0.648 # fortran minmax2 ... 1.891 If it is added to the test " m , M = ...; a = (a-m)/(M-m)' -> # numpy min/max ... 1.548 # fortran minmax1 ... 1.513 # fortran minmax2 ... 2.799 – Isaías Mar 16 '20 at 00:52
python became smart enough to skip the computation if the result isn't used it seems. – Michael May 12 '21 at 17:41
The loop iterator is likely also a check, so min & max on their own are 2 checks per loop. min/max in one loop is 3 checks per loop, (sure, maybe 2.5 if you have an else AND your data is random... but back to 3 checks if your data is sorted the wrong way) – Michael May 12 '21 at 17:50

Peque · Answer 3 · 2017-04-19T12:12:14.040

35

You could use Numba, which is a NumPy-aware dynamic Python compiler using LLVM. The resulting implementation is pretty simple and clear:

import numpy
import numba


@numba.jit
def minmax(x):
    maximum = x[0]
    minimum = x[0]
    for i in x[1:]:
        if i > maximum:
            maximum = i
        elif i < minimum:
            minimum = i
    return (minimum, maximum)


numpy.random.seed(1)
x = numpy.random.rand(1000000)
print(minmax(x) == (x.min(), x.max()))

It should also be faster than a Numpy's min() & max() implementation. And all without having to write a single C/Fortran line of code.

Do your own performance tests, as it is always dependent on your architecture, your data, your package versions...

edited Apr 19 '17 at 12:12

answered Nov 25 '15 at 14:30

Peque

13,638
11
69
105

2

> It should also be faster than a Numpy's min() & max() implementation I don't think this is right. numpy isn't native python -- it's C. ``` x = numpy.random.rand(10000000) t = time() for i in range(1000): minmax(x) print('numba ', time() - t) t = time() for i in range(1000): x.min() x.max() print('numpy ', time() - t) ``` Results in: ('numba ', 10.299750089645386) ('numpy ', 9.898081064224243) – Authman Apatira Mar 13 '17 at 23:56
2

@AuthmanApatira: Yeah, benchmarks are always like that, that is why I said it "*should*" (be faster) and "*do your own performance tests, as it is always dependent on your architecture, your data...*". In my case, I tried with 3 computers and got the same result (Numba was faster than Numpy), but in your computer results may differ... Did you try to execute `numba` function once before the benchmark to make sure it is JIT-compiled?. Also, if you use `ipython`, for simplicity, I would suggest you to use `%timeit whatever_code()` for measuring time execution. – Peque Mar 14 '17 at 08:19
4

@AuthmanApatira: In any case what I tried to show with this answer is that sometimes Python code (in this case JIT-compiled with Numba) can be as fast as the fastest C-compiled library (at least we are talking about the same order of magnitude), which is impressive taking into account that we wrote nothing else than pure Python code, don't you agree? ^^ – Peque Mar 14 '17 at 08:26
I do agree =) Also, thanks for the tips in the previous comment regarding jupyter and compiling the function once outside of the timing code. – Authman Apatira Mar 15 '17 at 13:35
4

Just ran across this, not that it matters in practical cases, but the `elif` allows for your minimum to be larger than your max. E.g., with an array of length 1, the max will be whatever that value is, while min is +infinity. Not a big deal for a one-off, but not good code to throw deep into the belly of a production beast. – Mike Williamson Apr 18 '17 at 16:43
1

@MikeWilliamson You are completely right, thanks for pointing that out! I updated my answer with better `minimum`/`maximum` initial values. ^^ – Peque Apr 19 '17 at 12:14

score 32 · Answer 4 · edited Nov 11 '15 at 18:23

32

There is a function for finding (max-min) called numpy.ptp if that's useful for you:

>>> import numpy
>>> x = numpy.array([1,2,3,4,5,6])
>>> x.ptp()
5

but I don't think there's a way to find both min and max with one traversal.

EDIT: ptp just calls min and max under the hood

edited Nov 11 '15 at 18:23

mab

2,658
26
36

answered Aug 30 '12 at 15:45

jterrace

64,866
22
157
202

3

It's annoying because presumably the way ptp is implemented it has to keep track of max and min! – Andy Hayden Aug 30 '12 at 16:50
1

Or it might just call max and min, not sure – jterrace Aug 30 '12 at 17:06
3

@hayden turns out ptp just calls max and min – jterrace Aug 30 '12 at 20:26
2

That was the masked-array code; the main ndarray code is in C. But it turns out the C code also iterates over the array twice: https://github.com/numpy/numpy/blob/29e6071d2376f9e96cd111adc7b274e07ac88e8d/numpy/core/src/multiarray/calculation.c#L318. – Ken Arnold Jun 14 '17 at 19:29

norok2 · Answer 5 · 2020-05-13T06:33:52.940

Just to get some ideas on the numbers one could expect, given the following approaches:

import numpy as np


def extrema_np(arr):
    return np.max(arr), np.min(arr)

import numba as nb


@nb.jit(nopython=True)
def extrema_loop_nb(arr):
    n = arr.size
    max_val = min_val = arr[0]
    for i in range(1, n):
        item = arr[i]
        if item > max_val:
            max_val = item
        elif item < min_val:
            min_val = item
    return max_val, min_val

import numba as nb


@nb.jit(nopython=True)
def extrema_while_nb(arr):
    n = arr.size
    odd = n % 2
    if not odd:
        n -= 1
    max_val = min_val = arr[0]
    i = 1
    while i < n:
        x = arr[i]
        y = arr[i + 1]
        if x > y:
            x, y = y, x
        min_val = min(x, min_val)
        max_val = max(y, max_val)
        i += 2
    if not odd:
        x = arr[n]
        min_val = min(x, min_val)
        max_val = max(x, max_val)
    return max_val, min_val

%%cython -c-O3 -c-march=native -a
#cython: language_level=3, boundscheck=False, wraparound=False, initializedcheck=False, cdivision=True, infer_types=True


import numpy as np


cdef void _extrema_loop_cy(
        long[:] arr,
        size_t n,
        long[:] result):
    cdef size_t i
    cdef long item, max_val, min_val
    max_val = arr[0]
    min_val = arr[0]
    for i in range(1, n):
        item = arr[i]
        if item > max_val:
            max_val = item
        elif item < min_val:
            min_val = item
    result[0] = max_val
    result[1] = min_val


def extrema_loop_cy(arr):
    result = np.zeros(2, dtype=arr.dtype)
    _extrema_loop_cy(arr, arr.size, result)
    return result[0], result[1]

%%cython -c-O3 -c-march=native -a
#cython: language_level=3, boundscheck=False, wraparound=False, initializedcheck=False, cdivision=True, infer_types=True


import numpy as np


cdef void _extrema_while_cy(
        long[:] arr,
        size_t n,
        long[:] result):
    cdef size_t i, odd
    cdef long x, y, max_val, min_val
    max_val = arr[0]
    min_val = arr[0]
    odd = n % 2
    if not odd:
        n -= 1
    max_val = min_val = arr[0]
    i = 1
    while i < n:
        x = arr[i]
        y = arr[i + 1]
        if x > y:
            x, y = y, x
        min_val = min(x, min_val)
        max_val = max(y, max_val)
        i += 2
    if not odd:
        x = arr[n]
        min_val = min(x, min_val)
        max_val = max(x, max_val)
    result[0] = max_val
    result[1] = min_val


def extrema_while_cy(arr):
    result = np.zeros(2, dtype=arr.dtype)
    _extrema_while_cy(arr, arr.size, result)
    return result[0], result[1]

(the extrema_loop_*() approaches are similar to what is proposed here, while extrema_while_*() approaches are based on the code from here)

The following timings:

indicate that the extrema_while_*() are the fastest, with extrema_while_nb() being fastest. In any case, also the extrema_loop_nb() and extrema_loop_cy() solutions do outperform the NumPy-only approach (using np.max() and np.min() separately).

Finally, note that none of these is as flexible as np.min()/np.max() (in terms of n-dim support, axis parameter, etc.).

(full code is available here)

Seems you can gain an extra 10% speed if use @njit(fastmath=True) `extrema_while_nb ` — argenisleon, Mar 24 '20 at 06:02

score 16 · Answer 6 · answered Jun 15 '18 at 13:18

Nobody mentioned numpy.percentile, so I thought I would. If you ask for [0, 100] percentiles, it will give you an array of two elements, the min (0th percentile) and the max (100th percentile).

However, it doesn't satisfy the OP's purpose: it's not faster than min and max separately. That's probably due to some machinery that would allow for non-extreme percentiles (a harder problem, which should take longer).

In [1]: import numpy

In [2]: a = numpy.random.normal(0, 1, 1000000)

In [3]: %%timeit
   ...: lo, hi = numpy.amin(a), numpy.amax(a)
   ...: 
100 loops, best of 3: 4.08 ms per loop

In [4]: %%timeit
   ...: lo, hi = numpy.percentile(a, [0, 100])
   ...: 
100 loops, best of 3: 17.2 ms per loop

In [5]: numpy.__version__
Out[5]: '1.14.4'

A future version of Numpy could put in a special case to skip the normal percentile calculation if only [0, 100] are requested. Without adding anything to the interface, there's a way to ask Numpy for min and max in one call (contrary to what was said in the accepted answer), but the standard implementation of the library doesn't take advantage of this case to make it worthwhile.

score 13 · Answer 7 · edited May 23 '17 at 10:31

In general you can reduce the amount of comparisons for a minmax algorithm by processing two elements at a time and only comparing the smaller to the temporary minimum and the bigger one to the temporary maximum. On average one needs only 3/4 of the comparisons than a naive approach.

This could be implemented in c or fortran (or any other low-level language) and should be almost unbeatable in terms of performance. I'm using numba to illustrate the principle and get a very fast, dtype-independant implementation:

import numba as nb
import numpy as np

@nb.njit
def minmax(array):
    # Ravel the array and return early if it's empty
    array = array.ravel()
    length = array.size
    if not length:
        return

    # We want to process two elements at once so we need
    # an even sized array, but we preprocess the first and
    # start with the second element, so we want it "odd"
    odd = length % 2
    if not odd:
        length -= 1

    # Initialize min and max with the first item
    minimum = maximum = array[0]

    i = 1
    while i < length:
        # Get the next two items and swap them if necessary
        x = array[i]
        y = array[i+1]
        if x > y:
            x, y = y, x
        # Compare the min with the smaller one and the max
        # with the bigger one
        minimum = min(x, minimum)
        maximum = max(y, maximum)
        i += 2

    # If we had an even sized array we need to compare the
    # one remaining item too.
    if not odd:
        x = array[length]
        minimum = min(x, minimum)
        maximum = max(x, maximum)

    return minimum, maximum

It's definetly faster than the naive approach that Peque presented:

arr = np.random.random(3000000)
assert minmax(arr) == minmax_peque(arr)  # warmup and making sure they are identical 
%timeit minmax(arr)            # 100 loops, best of 3: 2.1 ms per loop
%timeit minmax_peque(arr)      # 100 loops, best of 3: 2.75 ms per loop

As expected the new minmax implementation only takes roughly 3/4 of the time the naive implementation took (2.1 / 2.75 = 0.7636363636363637)

On my machine, your solution is not faster than Peque's. Numba 0.33. — John Zwinck, May 24 '17 at 04:55
@johnzwinck did you run the benchmark in my answer it a different one? If so could you share it? But it's possible: i noticed some regressions in newer versions too. — MSeifert, May 24 '17 at 07:36
I ran your benchmark. The timings of your solution and @Peque's were pretty much the same (~2.8 ms). — John Zwinck, May 24 '17 at 10:09
@JohnZwinck That's weird, I just tested it again and on my computer it's definetly faster. Maybe that has something to do with numba and LLVM that depends on the hardware. — MSeifert, May 29 '17 at 15:56
I tried on another machine now (a beefy workstation) and got 2.4 ms for yours vs 2.6 for Peque's. So, a small win. — John Zwinck, May 30 '17 at 01:36
ok, I got 8ms (mine) vs. 11ms (peque) on my older machine. That's definetly weird. :( — MSeifert, May 30 '17 at 01:37

score 9 · Answer 8 · answered Nov 13 '13 at 13:19

This is an old thread, but anyway, if anyone ever looks at this again...

When looking for the min and max simultaneously, it is possible to reduce the number of comparisons. If it is floats you are comparing (which I guess it is) this might save you some time, although not computational complexity.

Instead of (Python code):

_max = ar[0]
_min=  ar[0]
for ii in xrange(len(ar)):
    if _max > ar[ii]: _max = ar[ii]
    if _min < ar[ii]: _min = ar[ii]

you can first compare two adjacent values in the array, and then only compare the smaller one against current minimum, and the larger one against current maximum:

## for an even-sized array
_max = ar[0]
_min = ar[0]
for ii in xrange(0, len(ar), 2)):  ## iterate over every other value in the array
    f1 = ar[ii]
    f2 = ar[ii+1]
    if (f1 < f2):
        if f1 < _min: _min = f1
        if f2 > _max: _max = f2
    else:
        if f2 < _min: _min = f2
        if f1 > _max: _max = f1

The code here is written in Python, clearly for speed you would use C or Fortran or Cython, but this way you do 3 comparisons per iteration, with len(ar)/2 iterations, giving 3/2 * len(ar) comparisons. As opposed to that, doing the comparison "the obvious way" you do two comparisons per iteration, leading to 2*len(ar) comparisons. Saves you 25% of comparison time.

Maybe someone one day will find this useful.

have you benchmarked this? on modern x86 hardware you have machine instructions for min and max as used in the first variant, these avoid the need for branches while your code puts in a control dependency which probably does not map as well to the hardware. — jtaylor, Mar 07 '14 at 14:28
I haven't actually. Will do if I get a chance. I think it's pretty clear that pure python code will lose hands-down to any sensible compiled implementation, but I wonder if a speedup could be seen in Cython... — Bennet, Mar 12 '14 at 19:35
There is a minmax implementation in numpy, under the hood, used by `np.bincount`, see [here](https://github.com/numpy/numpy/blob/maintenance/1.9.x/numpy/lib/src/_compiled_base.c#L110). It does not use the trick you point out, because it turned out to be up to 2x slower than the naive approach. There is a link from the [PR](https://github.com/numpy/numpy/pull/4282) to some comprehensive benchmarks of both methods. — Jaime, Feb 05 '15 at 13:19

Stuart Berg · Answer 9 · 2015-03-04T14:14:49.670

6

At first glance, numpy.histogram appears to do the trick:

count, (amin, amax) = numpy.histogram(a, bins=1)

... but if you look at the source for that function, it simply calls a.min() and a.max() independently, and therefore fails to avoid the performance concerns addressed in this question. :-(

Similarly, scipy.ndimage.measurements.extrema looks like a possibility, but it, too, simply calls a.min() and a.max() independently.

edited Mar 04 '15 at 14:14

answered Jun 03 '14 at 17:35

Stuart Berg

17,026
12
67
99

3

`np.histogram` doesn't always work for this because the returned `(amin, amax)` values are for the minimum and maximum values of the bin. If I have, for example, `a = np.zeros(10)`, `np.histogram(a, bins=1)` returns `(array([10]), array([-0.5, 0.5]))`. The user is looking for `(amin, amax)` = (0, 0) in that case. – eclark Mar 31 '17 at 21:58

score 3 · Answer 10 · answered Sep 21 '19 at 19:43

It was worth the effort for me anyways, so I'll propose the most difficult and least elegant solution here for whoever may be interested. My solution is to implement a multi-threaded min-max in one pass algorithm in C++, and use this to create an Python extension module. This effort requires a bit of overhead for learning how to use the Python and NumPy C/C++ APIs, and here I will show the code and give some small explanations and references for whoever wishes to go down this path.

Multi-threaded Min/Max

There is nothing too interesting here. The array is broken into chunks of size length / workers. The min/max is calculated for each chunk in a future, which are then scanned for the global min/max.

    // mt_np.cc
    //
    // multi-threaded min/max algorithm

    #include <algorithm>
    #include <future>
    #include <vector>

    namespace mt_np {

    /*
     * Get {min,max} in interval [begin,end)
     */
    template <typename T> std::pair<T, T> min_max(T *begin, T *end) {
      T min{*begin};
      T max{*begin};
      while (++begin < end) {
        if (*begin < min) {
          min = *begin;
          continue;
        } else if (*begin > max) {
          max = *begin;
        }
      }
      return {min, max};
    }

    /*
     * get {min,max} in interval [begin,end) using #workers for concurrency
     */
    template <typename T>
    std::pair<T, T> min_max_mt(T *begin, T *end, int workers) {
      const long int chunk_size = std::max((end - begin) / workers, 1l);
      std::vector<std::future<std::pair<T, T>>> min_maxes;
      // fire up the workers
      while (begin < end) {
        T *next = std::min(end, begin + chunk_size);
        min_maxes.push_back(std::async(min_max<T>, begin, next));
        begin = next;
      }
      // retrieve the results
      auto min_max_it = min_maxes.begin();
      auto v{min_max_it->get()};
      T min{v.first};
      T max{v.second};
      while (++min_max_it != min_maxes.end()) {
        v = min_max_it->get();
        min = std::min(min, v.first);
        max = std::max(max, v.second);
      }
      return {min, max};
    }
    }; // namespace mt_np

The Python Extension Module

Here is where things start getting ugly... One way to use C++ code in Python is to implement an extension module. This module can be built and installed using the distutils.core standard module. A complete description of what this entails is covered in the Python documentation: https://docs.python.org/3/extending/extending.html. NOTE: there are certainly other ways to get similar results, to quote https://docs.python.org/3/extending/index.html#extending-index:

This guide only covers the basic tools for creating extensions provided as part of this version of CPython. Third party tools like Cython, cffi, SWIG and Numba offer both simpler and more sophisticated approaches to creating C and C++ extensions for Python.

Essentially, this route is probably more academic than practical. With that being said, what I did next was, sticking pretty close to the tutorial, create a module file. This is essentially boilerplate for distutils to know what to do with your code and create a Python module out of it. Before doing any of this it is probably wise to create a Python virtual environment so you don't pollute your system packages (see https://docs.python.org/3/library/venv.html#module-venv).

Here is the module file:

// mt_np_forpy.cc
//
// C++ module implementation for multi-threaded min/max for np

#define NPY_NO_DEPRECATED_API NPY_1_7_API_VERSION

#include <python3.6/numpy/arrayobject.h>

#include "mt_np.h"

#include <cstdint>
#include <iostream>

using namespace std;

/*
 * check:
 *  shape
 *  stride
 *  data_type
 *  byteorder
 *  alignment
 */
static bool check_array(PyArrayObject *arr) {
  if (PyArray_NDIM(arr) != 1) {
    PyErr_SetString(PyExc_RuntimeError, "Wrong shape, require (1,n)");
    return false;
  }
  if (PyArray_STRIDES(arr)[0] != 8) {
    PyErr_SetString(PyExc_RuntimeError, "Expected stride of 8");
    return false;
  }
  PyArray_Descr *descr = PyArray_DESCR(arr);
  if (descr->type != NPY_LONGLTR && descr->type != NPY_DOUBLELTR) {
    PyErr_SetString(PyExc_RuntimeError, "Wrong type, require l or d");
    return false;
  }
  if (descr->byteorder != '=') {
    PyErr_SetString(PyExc_RuntimeError, "Expected native byteorder");
    return false;
  }
  if (descr->alignment != 8) {
    cerr << "alignment: " << descr->alignment << endl;
    PyErr_SetString(PyExc_RuntimeError, "Require proper alignement");
    return false;
  }
  return true;
}

template <typename T>
static PyObject *mt_np_minmax_dispatch(PyArrayObject *arr) {
  npy_intp size = PyArray_SHAPE(arr)[0];
  T *begin = (T *)PyArray_DATA(arr);
  auto minmax =
      mt_np::min_max_mt(begin, begin + size, thread::hardware_concurrency());
  return Py_BuildValue("(L,L)", minmax.first, minmax.second);
}

static PyObject *mt_np_minmax(PyObject *self, PyObject *args) {
  PyArrayObject *arr;
  if (!PyArg_ParseTuple(args, "O", &arr))
    return NULL;
  if (!check_array(arr))
    return NULL;
  switch (PyArray_DESCR(arr)->type) {
  case NPY_LONGLTR: {
    return mt_np_minmax_dispatch<int64_t>(arr);
  } break;
  case NPY_DOUBLELTR: {
    return mt_np_minmax_dispatch<double>(arr);
  } break;
  default: {
    PyErr_SetString(PyExc_RuntimeError, "Unknown error");
    return NULL;
  }
  }
}

static PyObject *get_concurrency(PyObject *self, PyObject *args) {
  return Py_BuildValue("I", thread::hardware_concurrency());
}

static PyMethodDef mt_np_Methods[] = {
    {"mt_np_minmax", mt_np_minmax, METH_VARARGS, "multi-threaded np min/max"},
    {"get_concurrency", get_concurrency, METH_VARARGS,
     "retrieve thread::hardware_concurrency()"},
    {NULL, NULL, 0, NULL} /* sentinel */
};

static struct PyModuleDef mt_np_module = {PyModuleDef_HEAD_INIT, "mt_np", NULL,
                                          -1, mt_np_Methods};

PyMODINIT_FUNC PyInit_mt_np() { return PyModule_Create(&mt_np_module); }

In this file there is a significant use of the Python as well as the NumPy API, for more information consult: https://docs.python.org/3/c-api/arg.html#c.PyArg_ParseTuple, and for NumPy: https://docs.scipy.org/doc/numpy/reference/c-api.array.html.

Installing the Module

The next thing to do is to utilize distutils to install the module. This requires a setup file:

# setup.py

from distutils.core import setup,Extension

module = Extension('mt_np', sources = ['mt_np_module.cc'])

setup (name = 'mt_np', 
       version = '1.0', 
       description = 'multi-threaded min/max for np arrays',
       ext_modules = [module])

To finally install the module, execute python3 setup.py install from your virtual environment.

Testing the Module

Finally, we can test to see if the C++ implementation actually outperforms naive use of NumPy. To do so, here is a simple test script:

# timing.py
# compare numpy min/max vs multi-threaded min/max

import numpy as np
import mt_np
import timeit

def normal_min_max(X):
  return (np.min(X),np.max(X))

print(mt_np.get_concurrency())

for ssize in np.logspace(3,8,6):
  size = int(ssize)
  print('********************')
  print('sample size:', size)
  print('********************')
  samples = np.random.normal(0,50,(2,size))
  for sample in samples:
    print('np:', timeit.timeit('normal_min_max(sample)',
                 globals=globals(),number=10))
    print('mt:', timeit.timeit('mt_np.mt_np_minmax(sample)',
                 globals=globals(),number=10))

Here are the results I got from doing all this:

8  
********************  
sample size: 1000  
********************  
np: 0.00012079699808964506  
mt: 0.002468645994667895  
np: 0.00011947099847020581  
mt: 0.0020772050047526136  
********************  
sample size: 10000  
********************  
np: 0.00024697799381101504  
mt: 0.002037393998762127  
np: 0.0002713389985729009  
mt: 0.0020942929986631498  
********************  
sample size: 100000  
********************  
np: 0.0007130410012905486  
mt: 0.0019842900001094677  
np: 0.0007540129954577424  
mt: 0.0029724110063398257  
********************  
sample size: 1000000  
********************  
np: 0.0094779249993735  
mt: 0.007134920000680722  
np: 0.009129883001151029  
mt: 0.012836456997320056  
********************  
sample size: 10000000  
********************  
np: 0.09471094200125663  
mt: 0.0453535050037317  
np: 0.09436299200024223  
mt: 0.04188535599678289  
********************  
sample size: 100000000  
********************  
np: 0.9537652180006262  
mt: 0.3957935369980987  
np: 0.9624398809974082  
mt: 0.4019058070043684

These are far less encouraging than the results indicate earlier in the thread, which indicated somewhere around 3.5x speedup, and didn't incorporate multi-threading. The results I achieved are somewhat reasonable, I would expect that the overhead of threading and would dominate the time until the arrays got very large, at which point the performance increase would start to approach std::thread::hardware_concurrency x increase.

Conclusion

There is certainly room for application specific optimizations to some NumPy code, it would seem, in particular with regards to multi-threading. Whether or not it is worth the effort is not clear to me, but it certainly seems like a good exercise (or something). I think that perhaps learning some of those "third party tools" like Cython may be a better use of time, but who knows.

I start studying your code, know some C++ but still haven't use std::future and std::async. At your 'min_max_mt' template function, how does it knows that every worker has finished between firing and retrieving the results? (Asking just to understand, not saying that is anything wrong with it) — ChrCury78, Jun 23 '20 at 00:06
The line `v = min_max_it->get();`. The `get` method blocks until the result is ready and returns it. Since the loop goes through each future, it won't finish until they are all done. [future.get()](https://en.cppreference.com/w/cpp/thread/future/get) — Nathan Chappell, Jun 23 '20 at 07:21

score 2 · Answer 11 · answered Aug 28 '20 at 12:37

Inspired by the previous answer I've written numba implementation returning minmax for axis=0 from 2-D array. It's ~5x faster than calling numpy min/max. Maybe someone will find it useful.

from numba import jit

@jit
def minmax(x):
    """Return minimum and maximum from 2D array for axis=0."""    
    m, n = len(x), len(x[0])
    mi, ma = np.empty(n), np.empty(n)
    mi[:] = ma[:] = x[0]
    for i in range(1, m):
        for j in range(n):
            if x[i, j]>ma[j]: ma[j] = x[i, j]
            elif x[i, j]<mi[j]: mi[j] = x[i, j]
    return mi, ma

x = np.random.normal(size=(256, 11))
mi, ma = minmax(x)

np.all(mi == x.min(axis=0)), np.all(ma == x.max(axis=0))
# (True, True)


%timeit x.min(axis=0), x.max(axis=0) 
# 15.9 µs ± 9.4 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
%timeit minmax(x) 
# 2.62 µs ± 31.3 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

It could be a bit faster if you avoid `mi[:] = ma[:] = x[0]` and integrate that part of your code in else statements of your loop. `ma[:] = x[0]` is nothing more than a full loop over the array ma, which can be avoided. — max9111, Aug 28 '20 at 16:40

score 0 · Answer 12 · answered Nov 26 '19 at 14:11

0

The shortest way I've come up with is this:

mn, mx = np.sort(ar)[[0, -1]]

But since it sorts the array, it's not the most efficient.

Another short way would be:

mn, mx = np.percentile(ar, [0, 100])

This should be more efficient, but the result is calculated, and a float is returned.

answered Nov 26 '19 at 14:11

Israel Unterman

13,158
4
28
35

2

Shamefully, those two are the slowest solutions compare to others in this page: m = np.min(a); M = np.max(a) --> 0.54002 ||| m, M = f90_minmax1(a) --> 0.72134 ||| m, M = numba_minmax(a) --> 0.77323 ||| m, M = np.sort(a)[[0, -1]] --> 12.01456 ||| m, M = np.percentile(a, [0, 100]) --> 11.09418 ||| _in seconds for 10000 repetitions for an array of 100k elements_ – Isaías Mar 16 '20 at 01:54

Slobodan Ilic · Answer 13 · 2022-11-10T15:44:57.707

0

Maybe use numpy.unique? Like so:

min_, max_ = numpy.unique(arr)[[0, -1]]

Just added it here for variety :) It's just as slow as sort.

edited Nov 10 '22 at 15:44

answered Nov 10 '22 at 15:39

Slobodan Ilic

789
8
8

NumPy: function for simultaneous max() and min()

13 Answers13

Multi-threaded Min/Max

The Python Extension Module

Installing the Module

Testing the Module

Conclusion

Linked

Related