1

I have one pretty large np.array a (10,000-50,000 elements, each coordinates (x,y)) and another larger np.array b (100,000-200,000 coordinates). I need to remove as quickly as possible the elements of a that are not present in b and leave only the elements of a that are present in b. All coordinates are integers. For example:

a = np.array([[2,5],[6,3],[4,2],[1,4]])
b = np.array([[2,7],[4,2],[1,5],[6,3]])

Desired output:

a

>> [6,3],[4,2]

What is the fastest way of doing this for arrays of the size I mentioned?

I am OK with solutions that use any other packages or imports too (e.g., converting to a base Python list or set, using Pandas, etc.) besides those within Numpy.

  • 1
    since pandas is tagged, its a merge: `pd.DataFrame(a).merge(pd.DataFrame(b)).to_numpy()` – anky Apr 15 '21 at 16:51

2 Answers2

3

This appears to depend a lot on the array size and "sparseness" (likely due to hash table magic).

The answer from Get intersecting rows across two 2D numpy arrays is the so_8317022 function.

The takeaways seem to be (on my machine) that:

  • the Pandas approach has an edge with large sparse sets
  • set intersection is very, very fast with small array sizes (though admittedly it returns a set, not a numpy array)
  • the other Numpy answer can be faster than set intersection with larger array sizes.
from collections import defaultdict

import numpy as np
import pandas as pd
import timeit
import matplotlib.pyplot as plt


def pandas_merge(a, b):
    return pd.DataFrame(a).merge(pd.DataFrame(b)).to_numpy()


def set_intersection(a, b):
    return set(map(tuple, a.tolist())) & set(map(tuple, b.tolist()))


def so_8317022(a, b):
    nrows, ncols = a.shape
    dtype = {
        "names": ["f{}".format(i) for i in range(ncols)],
        "formats": ncols * [a.dtype],
    }
    C = np.intersect1d(a.view(dtype), b.view(dtype))
    return C.view(a.dtype).reshape(-1, ncols)


def test_fn(f, a, b):
    number, time_taken = timeit.Timer(lambda: f(a, b)).autorange()
    return number / time_taken


def test(size, max_coord):
    a = np.random.default_rng().integers(0, max_coord, size=(size, 2))
    b = np.random.default_rng().integers(0, max_coord, size=(size, 2))
    return {fn.__name__: test_fn(fn, a, b) for fn in (pandas_merge, set_intersection, so_8317022)}


series = []
datas = defaultdict(list)

for size in (100, 1000, 10000, 100000):
    for max_coord in (50, 500, 5000):
        print(size, max_coord)
        series.append((size, max_coord))
        for fn, result in test(size, max_coord).items():
            datas[fn].append(result)

print("size", "sparseness", "func", "ops/sec")
for fn, values in datas.items():
    for (size, max_coord), value in zip(series, values):
        print(size, max_coord, fn, int(value))

The results on my machine are

size sparseness func ops/sec
100 50 pandas_merge 895
100 500 pandas_merge 777
100 5000 pandas_merge 708
1000 50 pandas_merge 740
1000 500 pandas_merge 751
1000 5000 pandas_merge 660
10000 50 pandas_merge 513
10000 500 pandas_merge 460
10000 5000 pandas_merge 436
100000 50 pandas_merge 11
100000 500 pandas_merge 61
100000 5000 pandas_merge 49
100 50 set_intersection 42281
100 500 set_intersection 44050
100 5000 set_intersection 43584
1000 50 set_intersection 3693
1000 500 set_intersection 3234
1000 5000 set_intersection 3900
10000 50 set_intersection 453
10000 500 set_intersection 287
10000 5000 set_intersection 300
100000 50 set_intersection 47
100000 500 set_intersection 13
100000 5000 set_intersection 13
100 50 so_8317022 8927
100 500 so_8317022 9736
100 5000 so_8317022 7843
1000 50 so_8317022 698
1000 500 so_8317022 746
1000 5000 so_8317022 765
10000 50 so_8317022 89
10000 500 so_8317022 48
10000 5000 so_8317022 57
100000 50 so_8317022 10
100000 500 so_8317022 3
100000 5000 so_8317022 3
AKX
  • 152,115
  • 15
  • 115
  • 172
2

Not sure if this is the fastest way to do it, but if you turn it into a pandas index you can use its intersection method. Since it is using low-level c-code under the hood, the intersection step is probably pretty fast, but converting it into a pandas index may take some time

import numpy as np
import pandas as pd

a = np.array([[2, 5], [6, 3], [4, 2], [1, 4]])
b = np.array([[2, 7], [4, 2], [1, 5], [6, 3]])

df_a = pd.DataFrame(a).set_index([0, 1])
df_b = pd.DataFrame(b).set_index([0, 1])
intersection = df_a.index.intersection(df_b.index)

Result look like this

print(intersection.values)
[(6, 3) (4, 2)]

EDIT2:

Out of curiosity I made a comparison between the methods. Now with a larger list of indices. I have compared my first index method with a slightly improved method which does not require to create a dataframe first, but immediately creates the index, and then with the dataframe merge method proposed as well.

This is the code

from random import randint, seed
import time
import numpy as np
import pandas as pd

seed(0)

n_tuple = 100000
i_min = 0
i_max = 10
a = [[randint(i_min, i_max), randint(i_min, i_max)] for _ in range(n_tuple)]
b = [[randint(i_min, i_max), randint(i_min, i_max)] for _ in range(n_tuple)]
np_a = np.array(a)
np_b = np.array(b)


def method0(a_array, b_array):
    index_a = pd.DataFrame(a_array).set_index([0, 1]).index
    index_b = pd.DataFrame(b_array).set_index([0, 1]).index
    return index_a.intersection(index_b).to_numpy()


def method1(a_array, b_array):
    index_a = pd.MultiIndex.from_arrays(a_array.T)
    index_b = pd.MultiIndex.from_arrays(b_array.T)
    return index_a.intersection(index_b).to_numpy()


def method2(a_array, b_array):
    df_a = pd.DataFrame(a_array)
    df_b = pd.DataFrame(b_array)
    return df_a.merge(df_b).to_numpy()


def method3(a_array, b_array):
    set_a = {(_[0], _[1]) for _ in a_array}
    set_b = {(_[0], _[1]) for _ in b_array}
    return set_a.intersection(set_b)


for cnt, intersect in enumerate([method0, method1, method2, method3]):
    t0 = time.time()
    if cnt < 3:
        intersection = intersect(np_a, np_b)
    else:
        intersection = intersect(a, b)
    print(f"method{cnt}: {time.time() - t0}")

The output looks like:

method0: 0.1439347267150879
method1: 0.14012742042541504
method2: 4.740894317626953
method3: 0.05933070182800293

Conclusion: the merge method of dataframes (method2) is about 50 times slower than using intersections on the index. The version based on multiindex (method1) is only slightly faster than method0 (my first proposal)

EDIT2: As proposed by the comment of @AKX: if you do not use numpy but pure list and sets, you can again gain a speed up of about a factor 3. But it is clear you should not used the merge method.

Eelco van Vliet
  • 1,198
  • 12
  • 18
  • 2
    I'm pretty sure you don't need Pandas for this... Turning the data into a set of 2-tuples and using set intersection would also use C code. – AKX Apr 15 '21 at 17:07
  • Yep: using pure set of tuples is the fastest method. However, if you use the numpy arrays as the input arguments it is almost as fast. So it really depends if you need numpy arrays or not – Eelco van Vliet Apr 15 '21 at 18:07