3

I have the following code that does exactly what I want, but it too slow, since it involves an unnecessary materialization step:

### init
a = np.array([[1,2,3],[4,5,6],[7,8,9],[10,11,12]])

### condition 1) element 0 has to be larger than 1
### condition 2) limit the output to 2 elements
b = a[a[:,0] > 1][:2]

The problem is that this is very slow when I have a big array(given that I only want to cut off a tiny piece with condition 2). It can easily be done, but I have not found a way of putting it into a one-liner.

Hence, is there a neat way of doing this efficiently in a one-liner? Something like this:

b = a[a[:,0] > 1 and :2]

Thank you!

m3o
  • 3,881
  • 3
  • 35
  • 56
  • You should explain what the problem is, if it is the code itself or the math behind numpy. – peluzza Dec 19 '13 at 23:20
  • this question is better suited for codereview.stackexchange I think ... but I dont think you can ... – Joran Beasley Dec 19 '13 at 23:29
  • So in effect, you want the operation to short circuit after `n` values have been found. Is that right? – senderle Dec 19 '13 at 23:37
  • Yes, that is correct! – m3o Dec 19 '13 at 23:38
  • 1
    Why don't you simply reverse the order of the indexings? By doing `a[:, :2][a[:, 0] > 1]` you will not "materialize" the big array, but just one with the elements you are after. – Jaime Dec 19 '13 at 23:39
  • 2
    Unfortunately that does not work, since you potentially have elements that you would remove in your 2nd step. I need ```n``` elements if there are ```n``` elements. – m3o Dec 19 '13 at 23:41
  • 1
    It seems unlikely that there's a super-clean way of doing this right now. In fact, a short-circuiting `find` for numpy doesn't yet exist; it's a feature request for the 2.0 release according to the first answer to [this question](http://stackoverflow.com/questions/7632963/numpy-array-how-to-find-index-of-first-occurrence-of-item). If they don't even have that, I'm not sure there's a good way of doing this without using vanilla python or coding your own extension. – senderle Dec 19 '13 at 23:45
  • @marco I see, I misunderstood you only wanted to keep the first two columns of every row, not just the first two rows... – Jaime Dec 19 '13 at 23:55
  • @Jaime - You actually gave me an idea - I can take advantage of a slight twist in my setup. There are no more then ```n``` elements that I want to remove with *condition 1*. Hence I can do this: ```tmp_a = a[:n*2]; return tmp_a[tmp[:,0] > 1][:n]```. Since ```#a >> n``` this will give me a very nice speedup. But since I asked the question in a more generic manner this would not be the 100% correct answer. – m3o Dec 20 '13 at 09:37

3 Answers3

2

You might be able to speed up this code a little, your current code works like this:

a = np.array([[1,2,3],[4,5,6],[7,8,9],[10,11,12]])

# Check your condition
mask = a[:, 0] > 1

# Copy those rows the array that satisfy the condition 
temp = a[mask]

# Take first two rows of temp
b = temp[:2]

I suspect the most expensive operation is the copy operation in the middle, you could try this to avoid it by doing something like:

mask = a[:, 0] > 1

# Find the first two True values in mask
where = np.where(mask)[0][:2]

# Only copy the rows you really want
b = a[where]

There may be a more efficient way to find the first two True values, I havn't given it that much thought but he key is to find the values you want first then to only copy those rows.

Bi Rico
  • 25,283
  • 3
  • 52
  • 75
2

I can't think of a faster solution in straight numpy, but you can probably do a little better using numba:

from numba import autojit

def filtfunc(a):
    idx = []
    for ii in range(a.shape[0]):
        if (a[ii, 0] > 1):
            idx.append(ii)
            if (len(idx) == 2):
                break
    return a[idx]

jit_filter = autojit(filtfunc)

For reference, here are the other two proposed solutions:

def marco_filter(a):
    return a[a[:,0] > 1][:2]

def rico_filter(a):
    mask = a[:, 0] > 1
    where = np.where(mask)[0][:2]
    return a[where]

Some timings:

%%timeit a = np.random.random_integers(1, 12, (1000,1000))
marco_filter(a)
# 100 loops, best of 3: 11.6 ms per loop

%%timeit a = np.random.random_integers(1, 12, (1000,1000))
rico_filter(a)
# 10000 loops, best of 3: 44.8 µs per loop

%%timeit a = np.random.random_integers(1, 12, (1000,1000))
jit_filter(a)
# 10000 loops, best of 3: 30.7 µs per loop
ali_m
  • 71,714
  • 23
  • 223
  • 298
-3

If n=2 is quite small compared to a.shape[0], it might be worth while using this little function. The basic idea is calculate a mask that is just large enough to give the desired number of final rows. Here I do it iteratively. Normally iteration is slow, but if the number of iterations is small enough, the time saving elsewhere could be worth it.

def mask(a):
    return a[:,0]>1

def paul_filter1(a,n):
    # incremental w/ sum
    j = a.shape[0]
    for i in xrange(n,j+1):
        am = mask(a[:i,:])
        if np.sum(am)>=n:
            j = i
            break
    return a[am,:]

Note that the mask am can be shorter than the dimension it is working on. Effectively it pads the rest with False. I haven't checked if this is documented.

In this small example, foo is 3x slower than a[a[:,0]>1,:][:2,:].

But with a larger array, say a2=np.tile(a,[1000,1]), the time with foo remains the same, but the 'brute force' keeps slowing down as it has to apply the mask to more rows. Of course these timings do depend on where in a the desire rows are located. There won't be any savings if foo has to use nearly all of the rows.

edit

Addressing Bi Rico's concerns about the repeated np.sum (even through that is fast compiled code), we could incrementally build the where:

def paul_filter3(a,n):
    # incremental adding index
    j = a.shape[0]
    am = mask(a[:n,:])
    am = np.where(am)[0].tolist()
    if len(am)<n:
        for i in xrange(n,j):
            if mask(a[[i],:]):
                am.append(i)
                if len(am)>=n:
                    break
    am = np.array(am)
    return a[am,:]

For small n this is even faster.

Something closer to the original method, is to calculate the full mask, but then trim it. cumsum can be used to find the minimum length.

def paul_filter4(a,n):
    # cumsum
    am = mask(a)
    j = np.cumsum(am).searchsorted(n)
    return a[am[:j+1],:]

Tested with the 1000x1000 random array of integers (1:12), times are (using 20 instead of 2, and tweaking the mask so more rows are False

In [172]: timeit paul_filter4(a,20)
1000 loops, best of 3: 690 us per loop

In [173]: timeit paul_filter3(a,20)
1000 loops, best of 3: 1.22 ms per loop

In [175]: timeit paul_filter1(a,20)
1000 loops, best of 3: 994 us per loop

In [176]: timeit rico_filter(a,20)
1000 loops, best of 3: 668 us per loop

In [177]: timeit marco_filter(a,20)
10 loops, best of 3: 21 ms per loop  

rico_filter using where is fastest, but my alternative using cumsum isn't far behind. The 3 incremental filters are similar in speed, about half of the fast ones.

In this a as generated, and tested, most rows are True. This is consistent with marco's concern that the limit condition is a small subset of the logical condition. With these conditions Bi Rico's concern that paul_filter1 could blow up is unrealistic.

If I change the testing parameters, so all of the rows of a have to be tested (a[:,0]>11), then the filters using where and cumsum take just as long as the original. The incremental filters are slower, by a factor of 15 or more. But my first attempt using np.sum is fastest of that style.

hpaulj
  • 221,503
  • 14
  • 230
  • 353
  • While your basic idea is correct here, by using `am = am[:i, 0] > 1; if np.sum(am) == n ...` you've turned this into an O(N**2) approach. Even if n is small, this could blow up if `a[:, 0]` has a lot of small values. You could fix this by doing something similar to @ali_m's answer. The key is to avoid O(n) operations, like `sum`, inside your loop. – Bi Rico Dec 20 '13 at 20:24
  • I've tried some alternative incremental functions that don't use the `sum` in the loop, and don't get any speed up. `np.sum` is compiled code, so even if it is O(n) it doesn't slow down the loop. Any incremental approach will be slow if it ends up testing most of the rows. – hpaulj Dec 24 '13 at 07:46