Forewords
Your question is related to Nearest Neighbors Search (NNS).
One way to solve it is to build a spatial index like in Spatial Databases.
A straightforward solution is KD-Tree which is implemented in sklearn
.
Questions
At this point it is essential to know what question we want to answer:
Q1.a) Find all points in dataset B which are as close as (distance) points of A within a given threshold atol
(radius).
Or:
Q2.a) Find the k
closest point in a dataset B with respect to each point of my dataset A.
Both questions can be answered using KD-Tree, what we must realise is:
- Questions Q1 and Q2 are different, so are their answers;
- Q1 can map 0 or more points together, there is no guaranty about one-to-one mapping;
- Q2 will map exactly 1 to
k
points, there is a guaranty that all points in reference dataset are mapped to k
points in search dataset (provided there is enough points);
- Q2.a is generally not equivalent to its reciprocal question Q2.b (when datasets A and B are permuted).
MCVE
Lets build a MCVE to address both questions:
# Parameters
N = 50
atol = 50
keys = ['x', 'y']
# Trials Datasets (with different sizes, we keep it general):
df1 = pd.DataFrame(np.random.randint(0, 500, size=(N-5, 2)), columns=keys).reset_index()
df2 = pd.DataFrame(np.random.randint(0, 500, size=(N+5, 2)), columns=keys).reset_index()
# Spatial Index for Datasets:
kdt1 = KDTree(df1[keys].values, leaf_size=5, metric='euclidean')
kdt2 = KDTree(df2[keys].values, leaf_size=5, metric='euclidean')
# Answer Q2.a and Q2.b (searching for a single neighbour):
df1['kNN'] = kdt2.query(df1[keys].values, k=1, return_distance=False)[:,0]
df2['kNN'] = kdt1.query(df2[keys].values, k=1, return_distance=False)[:,0]
# Answer Q1.a and Q1.b (searching within a radius):
df1['radius'] = kdt2.query_radius(df1[keys].values, atol)
df2['radius'] = kdt1.query_radius(df2[keys].values, atol)
Bellow the result for Dataset A as reference:
index x y kNN radius
0 0 65 234 39 [39]
1 1 498 49 11 [11]
2 2 56 171 19 [29, 19]
3 3 239 43 20 [20]
4 4 347 32 50 [50]
[...]
At this point, we have everything required to spatially join our data.
Nearest Neighbors (k=1)
We can join our datasets using kNN index:
kNN1 = df1.merge(df2[['index'] + keys], left_on='kNN', right_on='index', suffixes=('_a', '_b'))
It returns:
index_a x_a y_a kNN radius index_b x_b y_b
0 0 65 234 39 [39] 39 49 260
1 1 498 49 11 [11] 11 487 4
2 2 56 171 19 [29, 19] 19 39 186
3 3 239 43 20 [20] 20 195 33
4 4 347 32 50 [50] 50 382 32
[...]
Graphically it leads to:

And reciprocal question is about:

We see that mapping is exactly 1
-to-k=1
all points in reference dataset are mapped to another point in search dataset. But answers differ when we swap reference.
Radius atol
We can also join our datasets using the radius index:
rad1 = df1.explode('radius')\
.merge(df2[['index'] + keys], left_on='radius', right_on='index',
suffixes=('_a', '_b'))
It returns:
index_a x_a y_a kNN radius index_b x_b y_b
0 0 65 234 39 39 39 49 260
2 1 498 49 11 11 11 487 4
3 2 56 171 19 29 29 86 167
4 2 56 171 19 19 19 39 186
7 3 239 43 20 20 20 195 33
[...]
Graphically:

Reciprocal answer is equivalent:

We see answers are identical, but there is no guaranty for a one-to-one mapping. Some points are not mapped (lonely points), some are mapped to many points (dense neighbourhood). Additionally, it requires an extra parameters atol
which must be tuned for a given context.
Bonus
Bellow the function to render figures:
def plot(A, B, join, title=''):
X = join.loc[:,['x_a','x_b']].values
Y = join.loc[:,['y_a','y_b']].values
fig, axe = plt.subplots()
axe.plot(A['x'], A['y'], 'x', label='Dataset A')
axe.plot(B['x'], B['y'], 'x', label='Dataset B')
for k in range(X.shape[0]):
axe.plot(X[k,:], Y[k,:], linewidth=0.75, color='black')
axe.set_title(title)
axe.set_xlabel(r'$x$')
axe.set_ylabel(r'$y$')
axe.grid()
axe.legend(bbox_to_anchor=(1,1), loc='upper left')
return axe
References
Some useful references: