3

Given a list of lists of size 2, I am trying to find the quickest way to determine the min/max value by index. The goal is to determine the bounds/extent of a series of XY points.

The sublists are unsorted (sorting on one index doesn't guarantee the other is sorted).

Currently I am doing the following:

xy = [(x1, y1), (x2, y2), ..., (xn, yn)]

xs, ys = zip(*xy)
xmax = max(xs)
xmin = min(xs)
ymax = max(ys)
ymin = min(ys)

If not mistaken, each operation is O(n), so overall complexity is O(n).

Is there a faster way for an list of arbitrary size?

pstatix
  • 3,611
  • 4
  • 18
  • 40
  • You could do this in a single pass by keeping track of the current mins and maxes seen so far. – Mark Jan 03 '20 at 15:57
  • Would `numpy` / `numba` be an option? What about `cython`? – norok2 Jan 03 '20 at 15:57
  • @MarkMeyer I hardly believe that explicit looping in pure Python will be faster. – norok2 Jan 03 '20 at 15:58
  • @MarkMeyer but I was [blatantly wrong](https://stackoverflow.com/a/59581931/5218354) :-) – norok2 Jan 03 '20 at 16:23
  • That's interesting @norok2 thanks for doing the analysis. I can't help thinking there's *some* way to get numpy to do it one pass, but can't think of anything. – Mark Jan 03 '20 at 16:37
  • @MarkMeyer Can I apply the loop to a numpy array? Is iteration over that faster? Or is the benefit just that the array will be more memory efficient? – pstatix Jan 03 '20 at 16:49
  • 1
    @MarkMeyer A good topic along your thought [here](https://stackoverflow.com/questions/12200580/numpy-function-for-simultaneous-max-and-min). – pstatix Jan 03 '20 at 17:02

3 Answers3

4

Here a couple of alternate methods:

def extrema_zip(items):
    split0, split1 = zip(*items)
    return max(split0), min(split0), max(split1), min(split1)
def extrema_key(items):
    return (
        max(items, key=lambda x: x[0])[0],
        min(items, key=lambda x: x[0])[0],
        max(items, key=lambda x: x[1])[1],
        min(items, key=lambda x: x[1])[1])
import numpy as np


def extrema_np(items):
    arr = np.array(items)
    return np.max(arr[:, 0]), np.min(arr[:, 0]), np.max(arr[:, 1]), np.min(arr[:, 1])
import numpy as np


def extrema_npt(items):
    arr = np.array(items).transpose()
    return np.max(arr[0, :]), np.min(arr[0, :]), np.max(arr[1, :]), np.min(arr[1, :])
def extrema_loop(items):
    iter_items = iter(items)
    first = next(iter_items)
    x_min = x_max = first[0]
    y_min = y_max = first[1]
    for x, y in iter_items:
        if x > x_max:
            x_max = x
        elif x < x_min:
            x_min = x
        if y > y_max:
            y_max = y
        elif y < y_min:
            y_min = y
    return x_max, x_min, y_max, y_min
import numpy as np
import numba as nb


@nb.jit(nopython=True)
def _extrema_loop_nb(arr):
    n, m = arr.shape
    x_min = x_max = arr[0, 0]
    y_min = y_max = arr[0, 1]
    for i in range(1, n):
        x, y = arr[i, :]
        if x > x_max:
            x_max = x
        elif x < x_min:
            x_min = x
        if y > y_max:
            y_max = y
        elif y < y_min:
            y_min = y
    return x_max, x_min, y_max, y_min


def extrema_loop_nb(items):    
    arr = np.array(items)
    return _extrema_loop_nb(arr)

and their respective timings as a function of input size:

bm_full bm_zoom

which shows that actually direct looping is beneficial for your use-case.

(full benchmarks available here)


See here for similar approaches working on NumPy array inputs.

norok2
  • 25,683
  • 4
  • 73
  • 99
  • I had to run the tests on my own as my system restricts images from imgur; but I found that the loop was fastest, then zip, then np. However, I think because conversion to an `np.ndarray` followed by a transpose on it is expensive. Thus if `xy` is an `np.ndarray` prior, you can reduce it to the `max` and `min` calls. – pstatix Jan 03 '20 at 17:46
  • @norok2 what did you use to make the graphs? – bad_coder Jan 03 '20 at 18:01
  • @pstatix I have updated the benchmarks to include the non-transposed one. The transposed one looks a tad faster but the difference is too small to be significant. I have also included the [colab notebook](https://colab.research.google.com/drive/1meqz2ds35AzvdU2lu5r56WPRxk1gp_t2), I hope you can meddle with that. – norok2 Jan 03 '20 at 18:38
  • 1
    @bad_coder I use my own pip-available library. You can check it out in this [colab notebook](https://colab.research.google.com/drive/1meqz2ds35AzvdU2lu5r56WPRxk1gp_t2) which I have also linked in the answer. – norok2 Jan 03 '20 at 18:40
  • Looks like I will stick with just the native Python loop. I ran a benchmark with an altered version of it in which `items = np.array(items)` to see if the array access times were faster but they were actually slower. The only benefit I can see here using an `np.ndarray` is the memory savings. – pstatix Jan 03 '20 at 19:54
  • @pstatix If you can start from a NumPy array, you can get to (much?) faster timings with Cython or Numba. – norok2 Jan 03 '20 at 20:13
  • @norok2 Well as mentioned, using the `extrema_loop` but pass in a `np.ndarray` is actually slower than passing in a list of lists for the iteration. Not entirely sure why, but nevertheless its an order of magnitude slower. – pstatix Jan 03 '20 at 20:20
  • @norok2 If you see my edit to the bottom of your post, youll see a method that I benchmarked to run >50% faster that `extrema_loop`. Here, `items` is an `np.ndarray` and is not converted inside the method as with `extrema_np`. I cannot use your benchmark tool as I have to log in to do so and my system doesnt permit that. If you want to benchmark it with your tool I'd be interested to see the results and if my testing was correct. – pstatix Jan 03 '20 at 20:45
  • @pstatix See [here](https://stackoverflow.com/a/59585600/5218354) for Numpy array inputs (including `np.max(axis=...)`/`np.min(axis=...)` solutions). – norok2 Jan 03 '20 at 21:35
  • 1
    iterating over a numpy array is always slower than a list, this is well known behavior. It's caused by numpy having to box it's primitive values and create an object on each access, whereas a python list is simply moving pointers around under the hood. – juanpa.arrivillaga Jan 03 '20 at 21:39
  • @juanpa.arrivillaga Good to know. I haven't really used NumPy a lot (mainly because I haven't needed to clog the namespaces yet when a list works). – pstatix Jan 04 '20 at 03:41
2

If you are willing to start from a NumPy array, you could adapt the solutions from here, omitting the senseless ones like those using zip() or the key parameter from min()/max()), and adding a couple more:

def extrema_py(arr):
    return max(arr[:, 0]), min(arr[:, 0]), max(arr[:, 1]), min(arr[:, 1])
import numpy as np


def extrema_np(arr):
    return np.max(arr[:, 0]), np.min(arr[:, 0]), np.max(arr[:, 1]), np.min(arr[:, 1])
import numpy as np


def extrema_npt(arr):
    arr = arr.transpose()
    return np.max(arr[0, :]), np.min(arr[0, :]), np.max(arr[1, :]), np.min(arr[1, :])
import numpy as np


def extrema_npa(arr):
    x_max, y_max = np.max(arr, axis=0)
    x_min, y_min = np.min(arr, axis=0)
    return x_max, x_min, y_max, y_min
import numpy as np


def extrema_npat(arr):
    arr = arr.transpose()
    x_max, y_max = np.max(arr, axis=1)
    x_min, y_min = np.min(arr, axis=1)
    return x_max, x_min, y_max, y_min
def extrema_loop(arr):
    n, m = arr.shape
    x_min = x_max = arr[0, 0]
    y_min = y_max = arr[0, 1]
    for i in range(1, n):
        x, y = arr[i, :]
        if x > x_max:
            x_max = x
        elif x < x_min:
            x_min = x
        if y > y_max:
            y_max = y
        elif y < y_min:
            y_min = y
    return x_max, x_min, y_max, y_min
import numba as nb


@nb.jit(nopython=True)
def extrema_loop_nb(arr):
    n, m = arr.shape
    x_min = x_max = arr[0, 0]
    y_min = y_max = arr[0, 1]
    for i in range(1, n):
        x = arr[i, 0]
        y = arr[i, 1]
        if x > x_max:
            x_max = x
        elif x < x_min:
            x_min = x
        if y > y_max:
            y_max = y
        elif y < y_min:
            y_min = y
    return x_max, x_min, y_max, y_min
%%cython -c-O3 -c-march=native -a
#cython: language_level=3, boundscheck=False, wraparound=False, initializedcheck=False, cdivision=True, infer_types=True

import numpy as np
import cython as cy


cdef void _extrema_loop_cy(
        long[:, :] arr,
        size_t n,
        size_t m,
        long[:, :] result):
    cdef size_t i, j
    cdef long x, y, x_max, x_min, y_max, y_min
    x_min = x_max = arr[0, 0]
    y_min = y_max = arr[0, 1]
    for i in range(1, n):
        x = arr[i, 0]
        y = arr[i, 1]
        if x > x_max:
            x_max = x
        elif x < x_min:
            x_min = x
        if y > y_max:
            y_max = y
        elif y < y_min:
            y_min = y
    result[0, 0] = x_max
    result[0, 1] = x_min
    result[1, 0] = y_max
    result[1, 1] = y_min


def extrema_loop_cy(arr):
    n, m = arr.shape
    result = np.zeros((2, m), dtype=arr.dtype)
    _extrema_loop_cy(arr, n, m, result)
    return result[0, 0], result[0, 1], result[1, 0], result[1, 1]

and their respective timings as a function of input size:

bm_full bm_zoom

So, for NumPy array inputs, one can gets much faster timings. The Numba- and Cython- based solution seems to be the fastest, remarkably outperforming the fastest NumPy-only approaches.

(full benchmarks available here)

(EDITED to improve Numba-based solution)

norok2
  • 25,683
  • 4
  • 73
  • 99
0

For the list of ANY size, if you're trying to find min and max, you have to take a look at EVERY element, which is O(N), and there's absolutely nothing we can do about that.

Having said that, you may, obviously, split the list into the chunks and feed them to the different processes, which will make it faster, depending on the processor cores you have.

lenik
  • 23,228
  • 4
  • 34
  • 43