85

I need to filter an array to remove the elements that are lower than a certain threshold. My current code is like this:

threshold = 5
a = numpy.array(range(10)) # testing data
b = numpy.array(filter(lambda x: x >= threshold, a))

The problem is that this creates a temporary list, using a filter with a lambda function (slow).

As this is quite a simple operation, maybe there is a numpy function that does it in an efficient way, but I've been unable to find it.

I thought that another way to achieve this could be sorting the array, finding the index of the threshold and returning a slice from that index onwards, but even if this would be faster for small inputs (and it won't be noticeable anyway), it's definitively asymptotically less efficient as the input size grows.

Update: I took some measurements too, and the sorting + slicing was still twice as fast as the pure python filter when the input was 100.000.000 entries.

r = numpy.random.uniform(0, 1, 100000000)

%timeit test1(r) # filter
# 1 loops, best of 3: 21.3 s per loop

%timeit test2(r) # sort and slice
# 1 loops, best of 3: 11.1 s per loop

%timeit test3(r) # boolean indexing
# 1 loops, best of 3: 1.26 s per loop
cottontail
  • 10,268
  • 18
  • 50
  • 51
fortran
  • 74,053
  • 25
  • 135
  • 175
  • 2
    yeah, it's quite nice :-) it even calculates automatically how many iterations it should perform to average the measurements if the code takes very little time to execute – fortran Nov 03 '11 at 15:32
  • 5
    @yosukesabai - IPython's `%timeit` uses the builtin `timeit` module. Have a look at it, as well. http://docs.python.org/library/timeit.html – Joe Kington Nov 03 '11 at 16:04

2 Answers2

114

b = a[a>threshold] this should do

I tested as follows:

import numpy as np, datetime
# array of zeros and ones interleaved
lrg = np.arange(2).reshape((2,-1)).repeat(1000000,-1).flatten()

t0 = datetime.datetime.now()
flt = lrg[lrg==0]
print datetime.datetime.now() - t0

t0 = datetime.datetime.now()
flt = np.array(filter(lambda x:x==0, lrg))
print datetime.datetime.now() - t0

I got

$ python test.py
0:00:00.028000
0:00:02.461000

http://docs.scipy.org/doc/numpy/user/basics.indexing.html#boolean-or-mask-index-arrays

yosukesabai
  • 6,184
  • 4
  • 30
  • 42
  • 1
    added test result, not just what I think it should do. :p – yosukesabai Nov 03 '11 at 12:38
  • 3
    This kind of indexing does not maintain the size of the array, how is it possible to keep the same number of elements and zeroing the subthreshold values? – linello Jul 24 '13 at 10:00
  • 9
    @linello, a[a<=threshold] = 0 is going to mask out the part that do not exceed the threshold – yosukesabai Aug 17 '13 at 19:25
  • 4
    I ran in to the issue of filtering based on two criteria. Here is the solution: http://stackoverflow.com/a/3248599/1373468 – Robin Newhouse Jan 12 '14 at 03:29
  • @yosukesabai Is it possible to do exactly this, without actually changing the original values. If `np.ma` is meant to do that, I cannot figure out how. – embert Jan 24 '14 at 11:04
  • @embert, not sure what you mean by "changing original values". The array `lrg` is not changed, and `flt` has all of values that i wanted (anything but zero) – yosukesabai Jan 25 '14 at 06:11
  • @yosukesabai Was referring to your comment `a[a<=threshold] = 0`. As I figured now, the same can be achieved using `np.ma.masked_where(a<=threshold, a)`, without zeroing but masking those values which are `<=threshold` – embert Jan 25 '14 at 08:48
  • What to do if we have float array. And we want to compare this array with float threshold? – Gusev Slava Mar 11 '17 at 15:47
  • @GusevSlava , i don't think it makes any difference, you can still use b = a[a>threshold], where a is array of float, threshold is float – yosukesabai Mar 11 '17 at 16:57
  • @yosukesabai For example, for such comparison `0.1 == 0.1` we don't know exactly the result because of floating point. We have to use special numpy functions such as `isclose`. And as the result we can't compare floats in the way `a > b` if i understand correctly. – Gusev Slava Mar 11 '17 at 17:02
  • I was thinking `>` comparison only. if you use isclose(), it is going to be `b = a[isclose(a, target)]` . – yosukesabai Mar 11 '17 at 17:26
  • @GusevSlava basically all you need is that expression in bracket should give True/False for each element, that's all needed. – yosukesabai Mar 11 '17 at 17:32
0

You can also use np.where to get the indices where the condition is True and use advanced indexing.

import numpy as np
b = a[np.where(a >= threshold)]

One useful function of np.where is that you can use it to replace values (e.g. replace values where the threshold is not met). While a[a <= 5] = 0 modifies a, np.where returns a new array with the same shape only with some values (potentially) changed.

a = np.array([3, 7, 2, 6, 1])
b = np.where(a >= 5, a, 0)       # array([0, 7, 0, 6, 0])

It's also very competitive in terms of performance.

a, threshold = np.random.uniform(0,1,100000000), 0.5

%timeit a[a >= threshold]
# 1.22 s ± 92.2 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

%timeit a[np.where(a >= threshold)]
# 1.34 s ± 258 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
cottontail
  • 10,268
  • 18
  • 50
  • 51