Goal
I am writing a "colocalization" script to identify unique co-localized pairs of coordinates between two sets of data. My data is quite large with <100k points in each set so performance is pretty important.
For example, I have two sets of points:
import numpy as np
points_a = np.array([[1, 1],[2, 2],[3, 3],[6, 6]])
points_b = np.array([[1, 1],[2, 3],[3, 5],[6, 6], [7,6]]) # may be longer than points_a
For each point in points_a
I want to find the nearest point in points_b
. However, I don't want any point in points_b
used in more than one pair. I can easily find the nearest neighbors using NearestNeighbors
or one of the similar routines:
from sklearn.neighbors import NearestNeighbors
neigh = NearestNeighbors(n_neighbors=1)
neigh.fit(points_b)
distances, indices = neigh.kneighbors(points_a)
print(indices)
>>> [0, 1, 1, 3]
As above, this can give me a solution where a point in point_b
is used twice. I would like to instead find the solution where each point is used once while minimizing the total distance across all pairs. In the above case:
[0, 1, 2, 3]
I figure a start would be to use NearestNeighbors or similar to find nearest neighbor candidates:
from scipy.spatial import KDTree
max_search_r = 3
from sklearn.neighbors import NearestNeighbors
neigh = NearestNeighbors(n_neighbors=1)
neigh.fit(points_b)
distances, indices = neigh.radius_neighbors(points_a, max_search_radius)
print(distances)
print(indices)
>>>[[0, 2,23], [1.41, 1], [2.82, 1, 2], [0, 1]]
>>>[[0, 1], [0, 1], [0, 1, 2], [0, 1]]
This shrinks down the overall search parameters but I am unclear how I can then compute the global optimum. I stumbled across this post: Find optimal unique neighbour pairs based on closest distance but the solution is for only a single set of points and I am unclear how I could translate the method to my case.
Any advice would be greatly appreciated!
Update
Hey all. With everyone's advice I found a somewhat working solution:
import numpy as np
from sklearn.neighbors import NearestNeighbors
from scipy.sparse import csr_matrix, csgraph
def colocalize_points(points_a: np.ndarray, points_b: np.ndarray, r: int):
""" Find pairs that minimize global distance. Filters out anything outside radius `r` """
neigh = NearestNeighbors(n_neighbors=1)
neigh.fit(points_b)
distances, b_indices = neigh.radius_neighbors(points_a, radius=r)
# flatten and get indices for A. This will also drop points in A with no matches in range
d_flat = np.hstack(distances) + 1
b_flat = np.hstack(b_indices)
a_flat = np.array([i for i, neighbors in enumerate(distances) for n in neighbors])
# filter out A points that cannot be matched
sm = csr_matrix((d_flat, (a_flat, b_flat)))
a_matchable = csgraph.maximum_bipartite_matching(sm, perm_type='column')
sm_filtered = sm[a_matchable != -1]
# now run the distance minimizing matching
row_match, col_match = csgraph.min_weight_full_bipartite_matching(sm_filtered)
return row_match, col_match
Only issue I have is that by filtering the matrix with maximum_bipartite_matching
I cannot be sure I truly have the best result since it just returns the first match. For example, if I have 2 points in A [[2,2][3,3]]
whose only candidate match is [3,3]
, maximum_bipartite_matching
will keep whichever appears first. So if [2,2]
appears first in the matrix, [3,3]
will be dropped despite being a better match.
Update 1
To address comments below, here is my reasoning why maximum_bipartite_matching
does not give me the desired solution. Consider points:
points_a = np.array([(1, 1), (2, 2), (3, 3)])
points_b = np.array([(1, 1), (2, 2), (3, 5), (2, 3)])
The optimal a,b point pairing that minimizes distance will be:
[(1, 1): (1, 1),
(2, 2): (2, 2),
(3, 3): (2, 3)]
However if I run the following:
neigh = NearestNeighbors(n_neighbors=1)
neigh.fit(points_b)
distances, b_indices = neigh.radius_neighbors(points_a, radius=3)
# flatten and get indices for A. This will also drop points in A with no matches in range
d_flat = np.hstack(distances) + 1
b_flat = np.hstack(b_indices)
a_flat = np.array([i for i, neighbors in enumerate(distances) for n in neighbors])
# filter out A points that cannot be matched
sm = csr_matrix((d_flat, (a_flat, b_flat)))
a_matchable = csgraph.maximum_bipartite_matching(sm, perm_type='column')
print([(points_a[a], points_b[i]) for i, a in enumerate(a_matchable)])
I get solution:
[(1, 1): (1, 1),
(2, 2): (2, 2),
(3, 3): (3, 5)]
Swapping the last two points in points_b
will give me the expected solution. This indicates to me that the algorithm is not taking the distance (weight) into account and instead just tries to maximize the number of connections. I could very well have made a mistake though so please let me know.