Defining a isclose
function similar to numpy.isclose
, but a bit faster (mostly due to not checking any input and not supporting both relative and absolute tolerance):
import numpy as np
array1 = np.array([[(1.22, 5.64)],
[(2.31, 7.63)],
[(4.94, 4.15)]])
array2 = np.array([[(1.23, 5.63)],
[(6.31, 10.63)],
[(2.32, 7.65)]])
def isclose(x, y, atol):
return np.abs(x - y) < atol
Now comes the hard part. We need to calculate if any two values are close within the inner most dimension. For this I reshape the arrays in such a way that the first array has its values along the second dimension, replicated across the first and the second array has its values along the first dimension, replicated along the second (note the 1, 3
and 3, 1
):
In [92]: isclose(array1.reshape(1,3,2), array2.reshape(3,1,2), 0.03)
Out[92]:
array([[[ True, True],
[False, False],
[False, False]],
[[False, False],
[False, False],
[False, False]],
[[False, False],
[ True, True],
[False, False]]], dtype=bool)
Now we want all entries where the value is close to any other value (along the same dimension):
In [93]: isclose(array1.reshape(1,3,2), array2.reshape(3,1,2), 0.03).any(axis=0)
Out[93]:
array([[ True, True],
[ True, True],
[False, False]], dtype=bool)
Then we want only those where both values of the tuple are close:
In [111]: isclose(array1.reshape(1,3,2), array2.reshape(3,1,2), 0.03).any(axis=0).all(axis=-1)
Out[111]: array([ True, True, False], dtype=bool)
And finally, we can use this to index array1
:
In [112]: array1[isclose(array1.reshape(1,3,2), array2.reshape(3,1,2), 0.03).any(axis=0).all(axis=-1)]
Out[112]:
array([[[ 1.22, 5.64]],
[[ 2.31, 7.63]]])
If you want to, you can swap the any
and all
calls. One might be faster than the other in your case.
The 3
in the reshape
calls needs to be substituted for the actual length of your data.
This algorithm will have the same bad runtime of the other answer using itertools.product
, but at least the actual looping is done implicitly by numpy
and is implemented in C. This is visible in the timings:
In [122]: %timeit array1[isclose(array1.reshape(1,len(array1),2), array2.reshape(len(array2),1,2), 0.03).any(axis=0).all(axis=-1)]
11.6 µs ± 493 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
In [126]: %timeit pares(array1_pares, array2_pares)
267 µs ± 8.72 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
Where the pares
function is the code defined by @Ferran Parés in another answer and the arrays as already reshaped there.
And for larger arrays it becomes more obvious:
array1 = np.random.normal(0, 0.1, size=(1000, 1, 2))
array2 = np.random.normal(0, 0.1, size=(1000, 1, 2))
array1_pares = array1.reshape(1000, 2)
array2_pares = arra2.reshape(1000, 2)
In [149]: %timeit array1[isclose(array1.reshape(1,len(array1),2), array2.reshape(len(array2),1,2), 0.03).any(axis=0).all(axis=-1)]
135 µs ± 5.34 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
In [157]: %timeit pares(array1_pares, array2_pares)
1min 36s ± 6.85 s per loop (mean ± std. dev. of 7 runs, 1 loop each)
In the end this is limited by the available system memory. My machine (16GB RAM) can still handle arrays of length 20000, but that pushes it almost to 100%. It also takes about 12s:
In [14]: array1 = np.random.normal(0, 0.1, size=(20000, 1, 2))
In [15]: array2 = np.random.normal(0, 0.1, size=(20000, 1, 2))
In [16]: %timeit array1[isclose(array1.reshape(1,len(array1),2), array2.reshape(len(array2),1,2), 0.03).any(axis=0).all(axis=-1)]
12.3 s ± 514 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)