Intersection of two arrays, retaining order in larger array

Question

I have a numpy array a of length n, which has the numbers 0 through n-1 shuffled in some way. I also have a numpy array mask of length <= n, containing some subset of the elements of a, in a different order.

The query I want to compute is "give me the elements of a that are also in mask in the order that they appear in a".

I had a similar question here, but the difference was that mask was a boolean mask instead of a mask on the individual elements.

I've outlined and tested 4 methods below:

import timeit
import numpy as np
import matplotlib.pyplot as plt

n_test = 100
n_coverages = 10

np.random.seed(0)


def method1():
    return np.array([x for x in a if x in mask])


def method2():
    s = set(mask)
    return np.array([x for x in a if x in s])


def method3():
    return a[np.in1d(a, mask, assume_unique=True)]


def method4():
    bmask = np.full((n_samples,), False)
    bmask[mask] = True
    return a[bmask[a]]


methods = [
    ('naive membership', method1),
    ('python set', method2),
    ('in1d', method3),
    ('binary mask', method4)
]

p_space = np.linspace(0, 1, n_coverages)
for n_samples in [1000]:
    a = np.arange(n_samples)
    np.random.shuffle(a)

    for label, method in methods:
        if method == method1 and n_samples == 10000:
            continue
        times = []
        for coverage in p_space:
            mask = np.random.choice(a, size=int(n_samples * coverage), replace=False)
            time = timeit.timeit(method, number=n_test)
            times.append(time * 1e3)
        plt.plot(p_space, times, label=label)
    plt.xlabel(r'Coverage ($\frac{|\mathrm{mask}|}{|\mathrm{a}|}$)')
    plt.ylabel('Time (ms)')
    plt.title('Comparison of 1-D Intersection Methods for $n = {}$ samples'.format(n_samples))
    plt.legend()
    plt.show()

Which produced the following results:

So, binary mask, is, without a doubt, the fastest method of these 4 for any size of the mask.

My question is, is there a faster way?

Good question. I was playing with your code with simple test case, a = np.array([10,15,30,20,18,29]), n_samples = len(a), mask = np.array([20,18]), Method 1 works. method4 (binary mask) gives me an error "IndexError: index 20 is out of bounds for axis 1 with size 6". — plasmon360, Mar 24 '17 at 01:36
@user1753919 The code I wrote for `method4` is only supposed to work if `a` is length `n` and **only contains numbers between 0 and n-1**, with no duplicates. — michaelsnowden, Mar 24 '17 at 01:39
For sake of variety, a method that is faster than `in1d` but slower than `binMask` can be found [here](http://stackoverflow.com/a/15940459/5992438) using the `pandas.Index.get_indexer` method. Something like `a[pd.Index(mask).get_indexer(a) >= 0]` should work. — bunji, Mar 24 '17 at 12:27

Ivan Gritsenko · Answer 1 · 2017-03-26T01:29:08.937

So, binary mask, is, without a doubt, the fastest method of these 4 for any size of the mask.

My question is, is there a faster way?

I totally agree that binary mask method is the fastest one. I also don't think there could be any better ways in terms of computation complexity to do what you need.

Let me analyse your method time results:

Method running time is T = O(|a|*|mask|) time. Every element of a is checked to be present in mask by iterating over every its element. It gives O(|mask|) time per element in the worst case when element is missing in mask. |a| does not change, consider it a constant.
|mask| = coverage * |a|
T = O(|a|² * coverage)
Hence a linear dependency of coverage in plot. Note that running time has quadratic dependency of |a|. If |mask| ≤ |a| and |a| = n then T = O(n²)
Second method is using set. Set is a data-structure that performs operations of insertion/lookup in O(log(n)), where n is a number of elements in the set. s = set(mask) takes O(|mask|*log(|mask|)) to complete because there are |mask| insertion operations.

x in s is a lookup operation. So second row runs in O(|a|*log(|mask|))

Overall time complexity is O(|mask|*log(|mask|) + |a|*log(|mask|)). If |mask| ≤ |a| and |a| = n then T = O(n*log(n)). You probably observe f(x) = log(x) dependency on plot.
in1d runs in O(|mask|*log(|mask|) + |a|*log(|mask|)) as well. Same T = O(n*log(n)) complexity and f(x) = log(x) dependency on plot.
Time complexity is O(|a| + |mask|) which is T = O(n) and its the best. You observe constant dependency on plot. Algorithm simply iterates over a and mask arrays couple of times.

The thing is that if you have to output n items you will already have T = O(n) complexity. So this method 4 algorithm is optimal.

P.S. In order to observe mentioned f(n) dependencies you'd better vary |a| and let |mask| = 0.9*|a|.

EDIT: Looks like python set indeed performs lookup/insert in O(1) using hash table.

Thanks for the answer. I don't think it should necessarily be O(n). In a perfect world, it should be O(|mask|) because that's the size of what I'm outputting. Also, CPython's set implementation uses hashing, so membership checks are O(1), not O(log |mask|). — michaelsnowden, Mar 24 '17 at 05:18

piRSquared · Answer 2 · 2017-03-24T01:32:18.957

0

Assuming a is the bigger one.

def with_searchsorted(a, b):

    sb = b.argsort()
    bs = b[sb]

    sa = a.argsort()
    ia = np.arange(len(a))
    ra = np.empty_like(sa)
    ra[sa] = ia

    ac = bs.searchsorted(ia) % b.size

    return a[(bs[ac] == ia)[ra]]

demo

a = np.arange(10)
np.random.shuffle(a)
b = np.random.choice(a, 5, False)

print(a)
print(b)

[7 2 9 3 0 4 8 5 6 1]
[0 8 5 4 6]

print(with_searchsorted(a, b))

[0 4 8 5 6]

how it works

# sort b for faster searchsorting
sb = b.argsort()
bs = b[sb]

# sort a for faster searchsorting
sa = a.argsort()
# this is the sorted a... we just cheat because we know what it will be
ia = np.arange(len(a))

# construct the reverse sort look up
ra = np.empty_like(sa)
ra[sa] = ia

# perform searchsort
ac = bs.searchsorted(ia) % b.size

return a[(bs[ac] == ia)[ra]]

edited Mar 24 '17 at 01:32

answered Mar 24 '17 at 01:26

piRSquared

285,575
57
475
624

Binary mask method will be faster. – Ivan Gritsenko Mar 24 '17 at 01:30
Your code seems to produce the exact same run time as `in1d`, which makes sense because they're both sorting under the hood. – michaelsnowden Mar 24 '17 at 01:54
This also works with other data types. Would binary mask as well? – piRSquared Mar 24 '17 at 02:00
@michaelsnowden fun problem btw – piRSquared Mar 24 '17 at 02:01

Intersection of two arrays, retaining order in larger array

2 Answers2