1

Given two arrays, arr1 and arr2, I want to extract from arr1, the first cap elements in the order it appears in arr2. So from the example below:

arr1 = np.array([3, 7, 1, 10, 2])
arr2 = np.array([0, 3, 1, 5, 2, 7, 6, 4])
cap = 3

From arr1, 3,7,1,2 appear in arr2 ordered as 3,1,2,7. I only want the first 3, so my desired outcome is array([3, 1, 2]).

I can get my desired outcome as follows:

ans = np.array([el for el in arr2 if el in arr1][:cap])

but this solution iterates over the entire arr2 when I only want the first 3 occurrences. Is there a more efficient solution?

  • 1
    Does this answer your question? [How to limit the size of a comprehension?](https://stackoverflow.com/questions/42393527/how-to-limit-the-size-of-a-comprehension) – Woodford Jun 23 '21 at 18:53
  • If you are strongly concerned about performance, you might want to revisit my revised answer and have a look at Tom's adjusted plots to see how it performs in comparison to the alternatives. Apparently, the amended for-loop outpaces both your original list-comprehension as well as `np.isin()` in most scenarios. – Georgy Kopshteyn Jun 24 '21 at 08:05

2 Answers2

1

Here's a straightforward iterative approach that stops after the cap has been reached:

lst=[]
for el in arr2:
    if el in arr1:
        lst.append(el)
        cap -=1
    if cap <1:
        break
ans = np.array(lst)
Georgy Kopshteyn
  • 678
  • 3
  • 13
0

Here is an attempt that doesn't use any iteration, and switches the membership checking to np.isin():

>>> arr2[np.isin(arr2, arr1[np.isin(arr1, arr2)])][:cap]
array([3, 1, 2])

Since this is arguably harder to read than your list comprehension, here are the steps broken down:

# find elements of arr1 in arr2
subset_arr1 = arr1[np.isin(arr1, arr2)] # array([3, 7, 1, 2])

# find the elements of arr2 in subset1, this gets the desired order
subset_arr2 = arr2[np.isin(arr2, subset_arr1)] # array([3, 1, 2, 7])

# take the first cap elements
ans = subset_arr2[:cap] # array([3, 1, 2])

Timing

My original testing indicated my np.isin() approach to be the fastest. However, Georgy since updated their answer (the for loop) with a bug fix. The new testing then indicated Georgy's answer to be fastest for all tested data sizes:

enter image description here

"np.isin" is my answer, "listcomp" is your original solution, "forloop" is the other existing answer. In the above, the size of N (x-axis) is the length of arr1, arr2 is double that length, and the cap is fixed at 5 (I thought this seemed reasonable, but see below for tweaks).

It seems the ability to stop early by the for loop can be very beneficial! In my answer, the code has to check all elements of arr1 for membership in arr2 (subset_arr1 above), and then all elements of arr2 for membership in that subset. Georgy's loop OTOH only has to check for membership of some of the elements of arr2 in arr1. So even though my looping should be faster, there can be much more of it!

The only parameter I tried fiddling with was the cap, which shows how smaller cap sizes are better for the for loop (smaller cap -> less iteration). The above was cap == 5; with cap == 100:

enter image description here

And cap == 1000:

enter image description here

I reckon the other parameters could influence things as well (e.g. how likely it is for the is-in operation to return true AKA the range of numerical values, or the relative sizes of the two arrays). I have made the perfplot code below more parameterized if anyone wants to test - I'm tired of thinking about complexity :

import numpy as np
import perfplot

cap = 5
minnumber = 1
maxnumber = 10
factor1 = 1 # how size of arr1 relates to n
factor2 = 2 # how size of arr2 relates to n
low_n = 3
high_n = 18

def make_data(n):
    arr1 = np.random.randint(minnumber, maxnumber, n*factor1)
    arr2 = np.random.randint(minnumber, maxnumber, n*factor2)
    return arr1, arr2

def listcomp(arr1, arr2):
    return np.array([el for el in arr2 if el in arr1][:cap])

def npisin(arr1, arr2):
    return arr2[np.isin(arr2, arr1[np.isin(arr1, arr2)])][:cap]

def forloop(arr1, arr2):
    capcopy = cap
    lst=[]
    for el in arr2:
        if el in arr1:
            lst.append(el)
            capcopy -=1
        if capcopy <1:
            break
    return np.array(lst)

perfplot.show(
    setup=lambda n : make_data(n),
    kernels = [
        lambda x: listcomp(*x),
        lambda x: npisin(*x),
        lambda x: forloop(*x),
        ],
    labels=['listcomp', 'np.isin', 'forloop'],
    n_range=[2 ** k for k in range(low_n, high_n)])
Tom
  • 8,310
  • 2
  • 16
  • 36
  • Great suggestion! I've slightly updated my for-loop, because in the old version the iteration did not actually stop after the cap has been reached but later matches have simply not been included. Would be interesting to know how this changes things. – Georgy Kopshteyn Jun 23 '21 at 19:29
  • @GeorgyKopshteyn your change is definitely warranted, and appears to be faster than mine for many cases! +1. See my thoughts above – Tom Jun 23 '21 at 20:09