How to find min/max values in array of variable-length arrays with numpy?

Question

I have an array like the following one:

data = [
  [-20],
  [-23],
  [-41],
  [1, 2, 3],
  [2, 3],
  [5, 6, 7, 8, 9],
]
arr = np.array(data)

How can I use numpy to find the minimum/maximum value of each array in data? Neither np.min nor np.max seem to work, even if I specify a different axis. The desired result would look like the following:

>>> np.findmin(arr)
array([-20, -23, -41, 1, 2, 5])
>>> np.findmax(arr)
array([-20, -23, -41, 3, 3, 9])

Also, I'm not entirely clear on why np.min and np.max aren't working. Perhaps they would only work the way I want if the given array had well-defined axes where each row had a fixed number of columns? If anyone can explain this, I would be interested to know.

That post doesn't answer the question that I'm asking. The array they're using contains arrays with a constant shape. I'm asking about arrays with variable shapes. — David Sanders, Jun 30 '14 at 16:15
that's why its a comment, and not a dupe or an answer. Its just related reading that may be helpful — wnnmaw, Jun 30 '14 at 16:16
http://stackoverflow.com/questions/5469286/how-to-get-the-index-of-a-maximum-element-in-a-numpy-array provides, in the question, an answer to your question, by using `amax`. Point taken on the variable shape arrays. — fiveclubs, Jun 30 '14 at 16:17
wnnmaw: Am I not free to point out why the post doesn't answer my question? — David Sanders, Jun 30 '14 at 16:21
@DavidSanders, you are, but I don't believe it was trying to answer your question — wnnmaw, Jun 30 '14 at 16:24

JaminSore · Accepted Answer · 2017-11-27T18:57:51.633

It's possible, but this isn't the sort of thing numpy is good at. One possible solution is to pad the array with nan and use np.nanmax like so

import numpy as np

def pad_array(arr):
    M = max(len(a) for a in arr)
    return np.array([a + [np.nan] * (M - len(a)) for a in arr])

data = [
  [-20],
  [-23],
  [-41],
  [1, 2, 3],
  [2, 3],
  [5, 6, 7, 8, 9],
]
arr = pad_array(data)
# array([[-20.,  nan,  nan,  nan,  nan],
#        [-23.,  nan,  nan,  nan,  nan],
#        [-41.,  nan,  nan,  nan,  nan],
#        [  1.,   2.,   3.,  nan,  nan],
#        [  2.,   3.,  nan,  nan,  nan],
#        [  5.,   6.,   7.,   8.,   9.]])

np.nanmin(arr, axis=1) #array([-20., -23., -41.,   1.,   2.,   5.])
np.nanmax(arr, axis=1) #array([-20., -23., -41.,   3.,   3.,   9.])

This isn't faster than a regular list comprehension, though. np.min and np.max are working, but numpy doesn't have support for ragged arrays so np.array(data) is making a one-dimensional array of objects, and np.min is giving you the smallest object--the same as you would get if you had used Python's builtin min function--the same goes with np.max.

Here are the timings comparing creating a padded array and using a plain list comprehension

%%timeit
arr = np.array(pad_array(data))
np.nanmin(arr, axis=1)
10000 loops, best of 3: 27 µs per loop

%timeit [min(row) for row in data]
1000000 loops, best of 3: 1.26 µs per loop

This is a bit contrived because I use a list comprehension and a generator expression in pad_array so it stands to reason that a single list comprehension is going to be faster, but if you were in a situation where you only needed to create the padded array once, a single list comprehension would still be faster.

%timeit np.nanmin(arr, axis=1)
100000 loops, best of 3: 13.3 µs per loop

EDIT:

You could use np.vectorize to make a vectorized version of Python's builtin max and min functions

vmax = np.vectorize(max)
vmax(data) #array([-20, -23, -41,   3,   3,   9])

It's still not faster than a list comprehension ...

%timeit vmax(data)
10000 loops, best of 3: 25.6 µs per loop

EDIT 2

For the sake of completeness/correctness, it is worth pointing out the the numpy solution will scale better than the pure Python list comprehension solution. Suppose we had 6 million rows instead of 6 and needed to perform multiple element-wise operations, numpy would be better. For example, if we have

data = [
  [-20],
  [-23],
  [-41],
  [1, 2, 3],
  [2, 3],
  [5, 6, 7, 8, 9],
] * 1000000

arr = pad_array(data) #this takes ~6 seconds

The timings are much more in favor of numpy

%timeit [min(row) for row in data]
1 loops, best of 3: 1.05 s per loop

%timeit np.nanmin(arr, axis=1)
10 loops, best of 3: 111 ms per loop

Very nice answer. Thanks for taking the time to compose that! — David Sanders, Jul 02 '14 at 22:57
This answer deserves a lot more upvotes just for the thoroughness. Well done! — alkanen, Nov 23 '17 at 09:02

score 2 · Answer 2 · answered Jun 30 '14 at 16:05

Why not use a list comprehension?

>>> d
[[-20], [-23], [-41], [1, 2, 3], [2, 3], [5, 6, 7, 8, 9]]
>>> [max(sublist) for sublist in d]
[-20, -23, -41, 3, 3, 9]
>>> [min(sublist) for sublist in d]
[-20, -23, -41, 1, 2, 5]

Will also work for a numpy array:

>>> from numpy import array
>>> d
array([[-20], [-23], [-41], [1, 2, 3], [2, 3], [5, 6, 7, 8, 9]], dtype=object)
>>> [max(sublist) for sublist in d]
[-20, -23, -41, 3, 3, 9]

Of couse, you can make the result an array.

>>> array([max(sublist) for sublist in d])
array([-20, -23, -41,   3,   3,   9])

It will all come down to the benchmarks. I'd still like to know how to do it with numpy if that's possible. It would help me gain a better understanding of how the library works. — David Sanders, Jun 30 '14 at 16:07

How to find min/max values in array of variable-length arrays with numpy?

2 Answers2