Intersection between two multi-dimensional arrays with tolerance - NumPy / Python

Question

i am stuck at a problem. I have two 2-D numpy arrays, filled with x and y coordinates. Those arrays might look like:

array1([[(1.22, 5.64)],
   [(2.31, 7.63)],
   [(4.94, 4.15)]],

array2([[(1.23, 5.63)],
   [(6.31, 10.63)],
   [(2.32, 7.65)]],

Now I have to find "duplicate nodes". However, i also have to consider nodes as equal within a given tolerance of the coordinates, therefore, i can't use solutions like this . Since my arrays are quite big (~200.000 lines each) two simple for loops are not an option as well. My final output should look like this:

output([[(1.23, 5.63)],
   [(2.32, 7.65)]],

I would appreciate some hints.

Cheers,

definitely try using the pandas library. It's meant for large data sets and has a built in intersection function. — Yserbius, Mar 15 '18 at 15:44
maybe you can approximate your result by rounding your decimals `np.around(array1, 1)` or `ceil` values `np.ceil(array1)` — J. Doe, Mar 15 '18 at 15:55
First of all, sorry for the late response and thank you for all the helpful approaches. Unfortunately, I couldn't use any of them without modifying my initial problem. Some suggestions were to time consuming, whereas others were to memory consuming. Nevertheless I marked all answers as useful which I tried and generally worked for the proposed issue. — , Mar 19 '18 at 14:38
@SebastianG So, how did you solve it finally for your case? Did you find something that's better than all of the listed solutions? If so, could you share? — Divakar, Mar 20 '18 at 16:18

score 4 · Answer 1 · answered Mar 15 '18 at 16:26

In order to compare to nodes with a giving tolerance I recommend to use numpy.isclose(), where you can set a relative and absolute tolerance.

numpy.isclose(1.24, 1.25, atol=1e-1)
# [True]
numpy.isclose([1.24, 2.31], [1.25, 2.32], atol=1e-1)
# [True, True]

Instead of using a two for loops, you can make use of itertools.product() package, to go through all pairs. The following code does what you want:

array1 = np.array([[1.22, 5.64],
                   [2.31, 7.63],
                   [4.94, 4.15]])

array2 = np.array([[1.23, 5.63],
                   [6.31, 10.63],
                   [2.32, 7.64]])

output = np.empty((0,2))
for i0, i1 in itertools.product(np.arange(array1.shape[0]),
                                np.arange(array2.shape[0])):
    if np.all(np.isclose(array1[i0], array2[i1], atol=1e-1)):
         output = np.concatenate((output, [array2[i1]]), axis=0)
# output = [[ 1.23  5.63]
#           [ 2.32  7.64]]

Graipher · Answer 2 · 2018-03-15T17:03:35.003

Defining a isclose function similar to numpy.isclose, but a bit faster (mostly due to not checking any input and not supporting both relative and absolute tolerance):

import numpy as np

array1 = np.array([[(1.22, 5.64)],
                   [(2.31, 7.63)],
                   [(4.94, 4.15)]])

array2 = np.array([[(1.23, 5.63)],
                    [(6.31, 10.63)],
                    [(2.32, 7.65)]])

def isclose(x, y, atol):
    return np.abs(x - y) < atol

Now comes the hard part. We need to calculate if any two values are close within the inner most dimension. For this I reshape the arrays in such a way that the first array has its values along the second dimension, replicated across the first and the second array has its values along the first dimension, replicated along the second (note the 1, 3 and 3, 1):

In [92]: isclose(array1.reshape(1,3,2), array2.reshape(3,1,2), 0.03)
Out[92]: 
array([[[ True,  True],
        [False, False],
        [False, False]],

       [[False, False],
        [False, False],
        [False, False]],

       [[False, False],
        [ True,  True],
        [False, False]]], dtype=bool)

Now we want all entries where the value is close to any other value (along the same dimension):

In [93]: isclose(array1.reshape(1,3,2), array2.reshape(3,1,2), 0.03).any(axis=0)
Out[93]: 
array([[ True,  True],
       [ True,  True],
       [False, False]], dtype=bool)

Then we want only those where both values of the tuple are close:

In [111]: isclose(array1.reshape(1,3,2), array2.reshape(3,1,2), 0.03).any(axis=0).all(axis=-1)
Out[111]: array([ True,  True, False], dtype=bool)

And finally, we can use this to index array1:

In [112]: array1[isclose(array1.reshape(1,3,2), array2.reshape(3,1,2), 0.03).any(axis=0).all(axis=-1)]
Out[112]: 
array([[[ 1.22,  5.64]],

       [[ 2.31,  7.63]]])

If you want to, you can swap the any and all calls. One might be faster than the other in your case.

The 3 in the reshape calls needs to be substituted for the actual length of your data.

This algorithm will have the same bad runtime of the other answer using itertools.product, but at least the actual looping is done implicitly by numpy and is implemented in C. This is visible in the timings:

In [122]: %timeit array1[isclose(array1.reshape(1,len(array1),2), array2.reshape(len(array2),1,2), 0.03).any(axis=0).all(axis=-1)]
11.6 µs ± 493 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

In [126]: %timeit pares(array1_pares, array2_pares)
267 µs ± 8.72 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

Where the pares function is the code defined by @Ferran Parés in another answer and the arrays as already reshaped there.

And for larger arrays it becomes more obvious:

array1 = np.random.normal(0, 0.1, size=(1000, 1, 2))
array2 = np.random.normal(0, 0.1, size=(1000, 1, 2))

array1_pares = array1.reshape(1000, 2)
array2_pares = arra2.reshape(1000, 2)

In [149]: %timeit array1[isclose(array1.reshape(1,len(array1),2), array2.reshape(len(array2),1,2), 0.03).any(axis=0).all(axis=-1)]
135 µs ± 5.34 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)

In [157]: %timeit pares(array1_pares, array2_pares)
1min 36s ± 6.85 s per loop (mean ± std. dev. of 7 runs, 1 loop each)

In the end this is limited by the available system memory. My machine (16GB RAM) can still handle arrays of length 20000, but that pushes it almost to 100%. It also takes about 12s:

In [14]: array1 = np.random.normal(0, 0.1, size=(20000, 1, 2))
In [15]: array2 = np.random.normal(0, 0.1, size=(20000, 1, 2))
In [16]: %timeit array1[isclose(array1.reshape(1,len(array1),2), array2.reshape(len(array2),1,2), 0.03).any(axis=0).all(axis=-1)]
12.3 s ± 514 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

So, we would need 1600GB RAM for 200,000 pts, right? Guess we need to wait for the future to arrive. — Divakar, Mar 16 '18 at 05:57
@Divakar Well either that or use a better algorithm (like in your answer). — Graipher, Mar 16 '18 at 06:43

Divakar · Answer 3 · 2018-03-15T17:52:45.247

There are many possible ways to define that tolerance. Since, we are talking about XY coordinates, most probably we are talking about euclidean distances to set that tolerance value. So, we can use Cython-powered kd-tree for quick nearest-neighbor lookup, which is very efficient both memory-wise and with performance. The implementation would look something like this -

from scipy.spatial import cKDTree

# Assuming a default tolerance value of 1 here
def intersect_close(a, b, tol=1):
    # Get closest distances for each pt in b
    dist = cKDTree(a).query(b, k=1)[0] # k=1 selects closest one neighbor

    # Check the distances against the given tolerance value and 
    # thus filter out rows off b for the final output
    return b[dist <= tol]

Sample step-by-step run -

# Input 2D arrays
In [68]: a
Out[68]: 
array([[1.22, 5.64],
       [2.31, 7.63],
       [4.94, 4.15]])

In [69]: b
Out[69]: 
array([[ 1.23,  5.63],
       [ 6.31, 10.63],
       [ 2.32,  7.65]])

# Get closest distances for each pt in b
In [70]: dist = cKDTree(a).query(b, k=1)[0]

In [71]: dist
Out[71]: array([0.01414214, 5.        , 0.02236068])

# Mask of distances within the given tolerance
In [72]: tol = 1

In [73]: dist <= tol
Out[73]: array([ True, False,  True])

# Finally filter out valid ones off b
In [74]: b[dist <= tol]
Out[74]: 
array([[1.23, 5.63],
       [2.32, 7.65]])

Timings on 200,000 pts -

In [20]: N = 200000
    ...: np.random.seed(0)
    ...: a = np.random.rand(N,2)
    ...: b = np.random.rand(N,2)

In [21]: %timeit intersect_close(a, b)
1 loop, best of 3: 1.37 s per loop

hpaulj · Answer 4 · 2018-03-15T16:48:18.267

As commented, scaling and rounding your numbers might allow you to use intersect1d or the equivalent.

And if you have just 2 columns, it might work to turn it into a 1d array of complex dtype.

But you might also want to keep in mind what intersect1d does:

if not assume_unique:
    # Might be faster than unique( intersect1d( ar1, ar2 ) )?
    ar1 = unique(ar1)
    ar2 = unique(ar2)
aux = np.concatenate((ar1, ar2))
aux.sort()
return aux[:-1][aux[1:] == aux[:-1]]

unique has been enhanced to handle rows (axis parameters), but intersect has not. In any case it uses argsort to put similar elements next to each other, and then skips the duplicates.

Notice that intersect concatenenates the unique arrays, sorts, and again finds the duplicates.

I know you didn't want a loop version, but to promote conceptualization of the problem here's one anyways:

In [581]: a = np.array([(1.22, 5.64),
     ...:    (2.31, 7.63),
     ...:    (4.94, 4.15)])
     ...: 
     ...: b = np.array([(1.23, 5.63),
     ...:    (6.31, 10.63),
     ...:    (2.32, 7.65)])
     ...:

I removed a layer of nesting in your arrays.

In [582]: c = []
In [583]: for a1 in a:
     ...:     for b1 in b:
     ...:         if np.allclose(a1,b1, atol=0.5): c.append((a1,b1))

or as list comprehension

In [586]: [(a1,b1) for a1 in a for b1 in b if np.allclose(a1,b1,atol=0.5)]
Out[586]: 
[(array([1.22, 5.64]), array([1.23, 5.63])),
 (array([2.31, 7.63]), array([2.32, 7.65]))]

complex approximation

In [604]: aa = (a*10).astype(int)
In [605]: aa
Out[605]: 
array([[12, 56],
       [23, 76],
       [49, 41]])
In [606]: ac=aa[:,0]+1j*aa[:,1]
In [607]: bb = (b*10).astype(int)
In [608]: bc=bb[:,0]+1j*bb[:,1]
In [609]: np.intersect1d(ac,bc)
Out[609]: array([12.+56.j, 23.+76.j])

intersect inspired

Concatenate the arrays, sort them, take difference, and find the small differences:

In [616]: ab = np.concatenate((a,b),axis=0)
In [618]: np.lexsort(ab.T)
Out[618]: array([2, 3, 0, 1, 5, 4], dtype=int32)
In [619]: ab1 = ab[_,:]
In [620]: ab1
Out[620]: 
array([[ 4.94,  4.15],
       [ 1.23,  5.63],
       [ 1.22,  5.64],
       [ 2.31,  7.63],
       [ 2.32,  7.65],
       [ 6.31, 10.63]])
In [621]: ab1[1:]-ab1[:-1]
Out[621]: 
array([[-3.71,  1.48],
       [-0.01,  0.01],
       [ 1.09,  1.99],
       [ 0.01,  0.02],
       [ 3.99,  2.98]])

In [623]: ((ab1[1:]-ab1[:-1])<.1).all(axis=1)  # refine with abs
Out[623]: array([False,  True, False,  True, False])
In [626]: np.where(Out[623])
Out[626]: (array([1, 3], dtype=int32),)
In [627]: ab[_]
Out[627]: 
array([[2.31, 7.63],
       [1.23, 5.63]])

DIY-DS · Answer 5 · 2018-03-15T18:33:54.473

May be you could try this using pure NP and self defined function:

import numpy as np
#Your Example
xDA=np.array([[1.22, 5.64],[2.31, 7.63],[4.94, 4.15],[6.1,6.2]])
yDA=np.array([[1.23, 5.63],[6.31, 10.63],[2.32, 7.65],[3.1,9.2]])
###Try this large sample###
#xDA=np.round(np.random.uniform(1,2, size=(5000, 2)),2)
#yDA=np.round(np.random.uniform(1,2, size=(5000, 2)),2)

print(xDA)
print(yDA)

#Match x to y
def np_matrix(myx,myy,calp=0.2):
    Xxx = np.transpose(np.repeat(myx[:, np.newaxis], myy.size, axis=1))
    Yyy = np.repeat(myy[:, np.newaxis], myx.size, axis=1)

    # define a caliper
    matches = {}
    dist = np.abs(Xxx - Yyy)
    for m in range(0, myx.size):
        if (np.min(dist[:, m]) <= calp) or not calp:
            matches[m] = np.argmin(dist[:, m])
    return matches


alwd_dist=0.1

xc1=xDA[:,1]
yc1=yDA[:,1]
m1=np_matrix(xc1,yc1,alwd_dist)
xc0=xDA[:,0]
yc0=yDA[:,0]
m0=np_matrix(xc0,yc0,alwd_dist)

shared_items = set(m1.items()) & set(m0.items())
if (int(len(shared_items))==0):
    print("No Matched Items based on given allowed distance:",alwd_dist)
else:
    print("Matched:")
    for ke in shared_items:
        print(xDA[ke[0]],yDA[ke[1]])

Intersection between two multi-dimensional arrays with tolerance - NumPy / Python

5 Answers5

complex approximation

intersect inspired

Linked