What is the fastest way to filter a numpy array by another array?

Question

I have one pretty large np.array a (10,000-50,000 elements, each coordinates (x,y)) and another larger np.array b (100,000-200,000 coordinates). I need to remove as quickly as possible the elements of a that are not present in b and leave only the elements of a that are present in b. All coordinates are integers. For example:

a = np.array([[2,5],[6,3],[4,2],[1,4]])
b = np.array([[2,7],[4,2],[1,5],[6,3]])

Desired output:

a

>> [6,3],[4,2]

What is the fastest way of doing this for arrays of the size I mentioned?

I am OK with solutions that use any other packages or imports too (e.g., converting to a base Python list or set, using Pandas, etc.) besides those within Numpy.

since pandas is tagged, its a merge: `pd.DataFrame(a).merge(pd.DataFrame(b)).to_numpy()` — anky, Apr 15 '21 at 16:51

AKX · Accepted Answer · 2021-04-15T17:39:05.080

This appears to depend a lot on the array size and "sparseness" (likely due to hash table magic).

The answer from Get intersecting rows across two 2D numpy arrays is the so_8317022 function.

The takeaways seem to be (on my machine) that:

the Pandas approach has an edge with large sparse sets
set intersection is very, very fast with small array sizes (though admittedly it returns a set, not a numpy array)
the other Numpy answer can be faster than set intersection with larger array sizes.

from collections import defaultdict

import numpy as np
import pandas as pd
import timeit
import matplotlib.pyplot as plt


def pandas_merge(a, b):
    return pd.DataFrame(a).merge(pd.DataFrame(b)).to_numpy()


def set_intersection(a, b):
    return set(map(tuple, a.tolist())) & set(map(tuple, b.tolist()))


def so_8317022(a, b):
    nrows, ncols = a.shape
    dtype = {
        "names": ["f{}".format(i) for i in range(ncols)],
        "formats": ncols * [a.dtype],
    }
    C = np.intersect1d(a.view(dtype), b.view(dtype))
    return C.view(a.dtype).reshape(-1, ncols)


def test_fn(f, a, b):
    number, time_taken = timeit.Timer(lambda: f(a, b)).autorange()
    return number / time_taken


def test(size, max_coord):
    a = np.random.default_rng().integers(0, max_coord, size=(size, 2))
    b = np.random.default_rng().integers(0, max_coord, size=(size, 2))
    return {fn.__name__: test_fn(fn, a, b) for fn in (pandas_merge, set_intersection, so_8317022)}


series = []
datas = defaultdict(list)

for size in (100, 1000, 10000, 100000):
    for max_coord in (50, 500, 5000):
        print(size, max_coord)
        series.append((size, max_coord))
        for fn, result in test(size, max_coord).items():
            datas[fn].append(result)

print("size", "sparseness", "func", "ops/sec")
for fn, values in datas.items():
    for (size, max_coord), value in zip(series, values):
        print(size, max_coord, fn, int(value))

The results on my machine are

size	sparseness	func	ops/sec
100	50	pandas_merge	895
100	500	pandas_merge	777
100	5000	pandas_merge	708
1000	50	pandas_merge	740
1000	500	pandas_merge	751
1000	5000	pandas_merge	660
10000	50	pandas_merge	513
10000	500	pandas_merge	460
10000	5000	pandas_merge	436
100000	50	pandas_merge	11
100000	500	pandas_merge	61
100000	5000	pandas_merge	49
100	50	set_intersection	42281
100	500	set_intersection	44050
100	5000	set_intersection	43584
1000	50	set_intersection	3693
1000	500	set_intersection	3234
1000	5000	set_intersection	3900
10000	50	set_intersection	453
10000	500	set_intersection	287
10000	5000	set_intersection	300
100000	50	set_intersection	47
100000	500	set_intersection	13
100000	5000	set_intersection	13
100	50	so_8317022	8927
100	500	so_8317022	9736
100	5000	so_8317022	7843
1000	50	so_8317022	698
1000	500	so_8317022	746
1000	5000	so_8317022	765
10000	50	so_8317022	89
10000	500	so_8317022	48
10000	5000	so_8317022	57
100000	50	so_8317022	10
100000	500	so_8317022	3
100000	5000	so_8317022	3

Thanks for considering and the effort to put in the detailed answer..!! Great help. :) — anky, Apr 15 '21 at 17:42

Eelco van Vliet · Answer 2 · 2021-04-15T18:04:58.537

Not sure if this is the fastest way to do it, but if you turn it into a pandas index you can use its intersection method. Since it is using low-level c-code under the hood, the intersection step is probably pretty fast, but converting it into a pandas index may take some time

import numpy as np
import pandas as pd

a = np.array([[2, 5], [6, 3], [4, 2], [1, 4]])
b = np.array([[2, 7], [4, 2], [1, 5], [6, 3]])

df_a = pd.DataFrame(a).set_index([0, 1])
df_b = pd.DataFrame(b).set_index([0, 1])
intersection = df_a.index.intersection(df_b.index)

Result look like this

print(intersection.values)
[(6, 3) (4, 2)]

EDIT2:

Out of curiosity I made a comparison between the methods. Now with a larger list of indices. I have compared my first index method with a slightly improved method which does not require to create a dataframe first, but immediately creates the index, and then with the dataframe merge method proposed as well.

This is the code

from random import randint, seed
import time
import numpy as np
import pandas as pd

seed(0)

n_tuple = 100000
i_min = 0
i_max = 10
a = [[randint(i_min, i_max), randint(i_min, i_max)] for _ in range(n_tuple)]
b = [[randint(i_min, i_max), randint(i_min, i_max)] for _ in range(n_tuple)]
np_a = np.array(a)
np_b = np.array(b)


def method0(a_array, b_array):
    index_a = pd.DataFrame(a_array).set_index([0, 1]).index
    index_b = pd.DataFrame(b_array).set_index([0, 1]).index
    return index_a.intersection(index_b).to_numpy()


def method1(a_array, b_array):
    index_a = pd.MultiIndex.from_arrays(a_array.T)
    index_b = pd.MultiIndex.from_arrays(b_array.T)
    return index_a.intersection(index_b).to_numpy()


def method2(a_array, b_array):
    df_a = pd.DataFrame(a_array)
    df_b = pd.DataFrame(b_array)
    return df_a.merge(df_b).to_numpy()


def method3(a_array, b_array):
    set_a = {(_[0], _[1]) for _ in a_array}
    set_b = {(_[0], _[1]) for _ in b_array}
    return set_a.intersection(set_b)


for cnt, intersect in enumerate([method0, method1, method2, method3]):
    t0 = time.time()
    if cnt < 3:
        intersection = intersect(np_a, np_b)
    else:
        intersection = intersect(a, b)
    print(f"method{cnt}: {time.time() - t0}")

The output looks like:

method0: 0.1439347267150879
method1: 0.14012742042541504
method2: 4.740894317626953
method3: 0.05933070182800293

Conclusion: the merge method of dataframes (method2) is about 50 times slower than using intersections on the index. The version based on multiindex (method1) is only slightly faster than method0 (my first proposal)

EDIT2: As proposed by the comment of @AKX: if you do not use numpy but pure list and sets, you can again gain a speed up of about a factor 3. But it is clear you should not used the merge method.

I'm pretty sure you don't need Pandas for this... Turning the data into a set of 2-tuples and using set intersection would also use C code. — AKX, Apr 15 '21 at 17:07
Yep: using pure set of tuples is the fastest method. However, if you use the numpy arrays as the input arguments it is almost as fast. So it really depends if you need numpy arrays or not — Eelco van Vliet, Apr 15 '21 at 18:07

What is the fastest way to filter a numpy array by another array?

2 Answers2