3

Goal

I am writing a "colocalization" script to identify unique co-localized pairs of coordinates between two sets of data. My data is quite large with <100k points in each set so performance is pretty important.

For example, I have two sets of points:

import numpy as np
points_a = np.array([[1, 1],[2, 2],[3, 3],[6, 6]])
points_b = np.array([[1, 1],[2, 3],[3, 5],[6, 6], [7,6]]) # may be longer than points_a

For each point in points_a I want to find the nearest point in points_b. However, I don't want any point in points_b used in more than one pair. I can easily find the nearest neighbors using NearestNeighbors or one of the similar routines:

from sklearn.neighbors import NearestNeighbors
neigh = NearestNeighbors(n_neighbors=1)
neigh.fit(points_b)
distances, indices = neigh.kneighbors(points_a)
print(indices)
>>> [0, 1, 1, 3]

As above, this can give me a solution where a point in point_b is used twice. I would like to instead find the solution where each point is used once while minimizing the total distance across all pairs. In the above case:

[0, 1, 2, 3]

I figure a start would be to use NearestNeighbors or similar to find nearest neighbor candidates:

from scipy.spatial import KDTree

max_search_r = 3
from sklearn.neighbors import NearestNeighbors
neigh = NearestNeighbors(n_neighbors=1)
neigh.fit(points_b)
distances, indices = neigh.radius_neighbors(points_a, max_search_radius)
print(distances)
print(indices)
>>>[[0, 2,23], [1.41, 1], [2.82, 1, 2], [0, 1]]
>>>[[0, 1], [0, 1], [0, 1, 2], [0, 1]]

This shrinks down the overall search parameters but I am unclear how I can then compute the global optimum. I stumbled across this post: Find optimal unique neighbour pairs based on closest distance but the solution is for only a single set of points and I am unclear how I could translate the method to my case.

Any advice would be greatly appreciated!

Update

Hey all. With everyone's advice I found a somewhat working solution:

import numpy as np
from sklearn.neighbors import NearestNeighbors
from scipy.sparse import csr_matrix, csgraph


def colocalize_points(points_a: np.ndarray, points_b: np.ndarray, r: int):
    """ Find pairs that minimize global distance. Filters out anything outside radius `r` """

    neigh = NearestNeighbors(n_neighbors=1)
    neigh.fit(points_b)
    distances, b_indices = neigh.radius_neighbors(points_a, radius=r)

    # flatten and get indices for A. This will also drop points in A with no matches in range
    d_flat = np.hstack(distances) + 1
    b_flat = np.hstack(b_indices)
    a_flat = np.array([i for i, neighbors in enumerate(distances) for n in neighbors])

    # filter out A points that cannot be matched
    sm = csr_matrix((d_flat, (a_flat, b_flat)))
    a_matchable = csgraph.maximum_bipartite_matching(sm, perm_type='column')
    sm_filtered = sm[a_matchable != -1]

    # now run the distance minimizing matching
    row_match, col_match = csgraph.min_weight_full_bipartite_matching(sm_filtered)
    return row_match, col_match

Only issue I have is that by filtering the matrix with maximum_bipartite_matching I cannot be sure I truly have the best result since it just returns the first match. For example, if I have 2 points in A [[2,2][3,3]] whose only candidate match is [3,3], maximum_bipartite_matching will keep whichever appears first. So if [2,2] appears first in the matrix, [3,3] will be dropped despite being a better match.

Update 1

To address comments below, here is my reasoning why maximum_bipartite_matching does not give me the desired solution. Consider points:

points_a = np.array([(1, 1), (2, 2), (3, 3)])
points_b = np.array([(1, 1), (2, 2), (3, 5), (2, 3)])

The optimal a,b point pairing that minimizes distance will be:

[(1, 1): (1, 1),
(2, 2): (2, 2),
(3, 3): (2, 3)]

However if I run the following:

neigh = NearestNeighbors(n_neighbors=1)
neigh.fit(points_b)
distances, b_indices = neigh.radius_neighbors(points_a, radius=3)

# flatten and get indices for A. This will also drop points in A with no matches in range
d_flat = np.hstack(distances) + 1
b_flat = np.hstack(b_indices)
a_flat = np.array([i for i, neighbors in enumerate(distances) for n in neighbors])

# filter out A points that cannot be matched
sm = csr_matrix((d_flat, (a_flat, b_flat)))
a_matchable = csgraph.maximum_bipartite_matching(sm, perm_type='column')
print([(points_a[a], points_b[i]) for i, a in enumerate(a_matchable)])

I get solution:

[(1, 1): (1, 1),
(2, 2): (2, 2),
(3, 3): (3, 5)]

Swapping the last two points in points_b will give me the expected solution. This indicates to me that the algorithm is not taking the distance (weight) into account and instead just tries to maximize the number of connections. I could very well have made a mistake though so please let me know.

Ricardo
  • 31
  • 1
  • 3
  • Are you trying to find the boundaries of the clusters – Golden Lion Feb 08 '22 at 03:17
  • No, for every point in points_a I am trying to find the nearest neighbor in points_b. These are spatial positions of objects that have other properties to them beside location. I need to find the nearest neighbor for each point to analyze how the other properties of these points correlate between neighbors. – Ricardo Feb 08 '22 at 03:25
  • 1
    This is like an **assignment problem** (a slight variant). Your problem is rather large but assignment problems can be solved quite efficiently. – Erwin Kalvelagen Feb 08 '22 at 10:13
  • Ah thank you. That is definitely the correct field. Something like `scipy.optimize.linear_sum_assignment` might do the trick but creating a cost matrix for 100,000 x 100,000 points would be too large for memory. If I filter for nearest neighbor candidates, most of the matrix will be 0s, making a lot of the matrix unnescessary. Are there any forms that dont require generating a matrix? – Ricardo Feb 08 '22 at 18:17
  • 1
    If you can filter the candidates, you can use a sparse matrix and use `scipy.sparse.csgraph.min_weight_full_bipartite_matching`. https://docs.scipy.org/doc/scipy/reference/generated/scipy.sparse.csgraph.min_weight_full_bipartite_matching.html#scipy.sparse.csgraph.min_weight_full_bipartite_matching – Juan Feb 08 '22 at 22:47
  • Thanks. That is definitely the way to go. Unfortunately for my real data I get the error No full matching exists which is not a surprise. Is it possible to use `scipy.sparse.csgraph.maximum_bipartite_matching` to do extra filtering prior to `scipy.sparse.csgraph.min_weight_full_bipartite_matching`? – Ricardo Feb 11 '22 at 01:37
  • *maximum_bipartite_matching will keep whichever appears first*. No, not at all. It will choose the best. – Erwin Kalvelagen Feb 13 '22 at 11:39
  • That didn't seem to be the case in my testing but I very well could have made a mistake. How do you know that to be the case? Is best the minimum weight? I didn't see anything in the docs. – Ricardo Feb 14 '22 at 19:46
  • That is what maximum bipartite matching is about. It is not a heuristic. – Erwin Kalvelagen Feb 17 '22 at 02:55
  • I suggest creating a [Minimal Reproducing Example](https://stackoverflow.com/help/minimal-reproducible-example) that demonstrates the problem. That makes it easier to see what is going on. – Erwin Kalvelagen Feb 17 '22 at 18:04

0 Answers0