5

General problem

First let's explain the problem more generally. I have a collection of points with x,y coordinates and want to find the optimal unique neighbour pairs such that the distance between the neighbours in all pairs is minimised, but points cannot be used in more than one pair.

Some simple examples

Note: points are not ordered and x and y coordinates will both vary between 0 and 1000, but for simplicity in below examples x==y and items are ordered.

First, let's say I have the following matrix of points:

matrix1 = np.array([[1, 1],[2, 2],[5, 5],[6, 6]])

For this dataset, the output should be [0,0,1,1] as points 1 and 2 are closest to each other and points 3 and 4, providing pairs 0 and 2.

Second, two points cannot have the same partner. If we have the matrix:

matrix2 = np.array([[1, 1],[2, 2],[4, 4],[6, 6]])

Here pt1 and pt3 are closest to pt2, but pt1 is relatively closer, so the output should again be [0,0,1,1].

Third, if we have the matrix :

matrix3 = np.array([[1, 1],[2, 2],[3, 3],[4, 4]])

Now pt1 and pt3 are again closest to pt2 but now they are at the same distance. Now the output should again be [0,0,1,1] as pt4 is closest to pt3.

Fourth, in the case of an uneven number of points, the furthest point should be made nan, e.g.

matrix4 = np.array([[1, 1],[2, 2],[4,4]])

should give output [0,0,nan]

Fifth, in the case there are three or more points with exactly the same distance, the pairing can be random, e.g.

matrix5 = np.array([[1, 1],[2, 2],[3, 3]])

both an output of '[0,0,nan]and[nan,0,0]` should be fine.

My effort

Using sklearn:

import numpy as np
from sklearn.neighbors import NearestNeighbors
data = matrix3
nbrs = NearestNeighbors(n_neighbors=len(data), algorithm="ball_tree")
nbrs = nbrs.fit(data)
distances,indices = nbrs.kneighbors(data)

This outputs instances:

array([[0, 1, 2, 3],
       [1, 2, 0, 3],
       [2, 1, 3, 0],
       [3, 2, 1, 0]]))

The second column provides the nearest points:

nearinds = `indices[:,1]`

Next in case there are duplicates in the list we need to find the nearest distance:

if len(set(nearinds) != len(nearinds):
    dupvals = [i for i in set(nearinds) if list(nearinds).count(i) > 1]
    for dupval in dupvals:
        dupinds = [i for i,j in enumerate(nearinds) if j == dupval]
        dupdists = distances[dupinds,1]

Using these dupdists I would be able to find that one is closer to the pt than the other:

       if len(set(dupdists))==len(dupdists):
            duppriority = np.argsort(dupdists)

Using the duppriority values we can provide the closer pt its right pairing. But to give the other point its pairing will then depend on its second nearest pairing and the distance of all other points to that same point.. Furthermore, if both points are the same distance to their closest point, I would also need to go one layer deeper:

        if len(set(dupdists))!=len(dupdists):
            dupdists2 = [distances[i,2] for i,j in enumerate(inds) if j == dupval]```
            if len(set(dupdists2))==len(dupdists2):
                duppriority2 = np.argsort(dupdists2)  

etc..

I am kind of stuck here and also feel it is not very efficient in this way, especially for more complicated conditions than 4 points and where multiple points can be similar distance to one or multiple nearest, second-nearest etc points..

I also found that with scipy there is a similar one-line command that could be used to get the distances and indices:

from scipy.spatial import cKDTree
distances,indices = cKDTree(matrix3).query(matrix3, k=len(matrix3))

so am wondering if one would be better to continue with vs the other.

More specific problem that I want to solve

I have a list of points and need to match them optimally to a list of points previous in time. Number of points is generally limited and ranges from 2 to 10 but is generally consistent over time (i.e. it won't jump much between values over time). Data tends to look like:

prevdat = {'loc': [(300, 200), (425, 400), (400, 300)], 'contid': [0, 1, 2]}
currlocs = [(435, 390), (405, 295), (290, 215),(440,330)]`

Pts in time are generally closer to themselves than to others. Thus I should be able to link identities of the points over time. There are however a number of complications that need to be overcome:

  1. sometimes there is no equal number of current and previous points
  2. points often have the same closest neighbour but should not be able to be allocated the same identity
  3. points sometimes have the same distance to closest neighbour (but very unlikely to 2nd, 3rd nearest-neighbours etc.

Any advice to help solve my problem would be much appreciated. I hope my examples and effort above will help. Thanks!

crazjo
  • 485
  • 1
  • 8
  • 20
  • Although it wasnt competely clear to me what you are tryiing to do, my initial reaction was, why aren't you using [cKDTree](https://docs.scipy.org/doc/scipy/reference/generated/scipy.spatial.cKDTree.html). Once you have the tree there are a number of helper methods that allow you to query nieghourhoods, distances, etc.. – DrBwts May 03 '21 at 11:32
  • I read and understood what you did, but not the problem. Are you looking for some global optimization on total sum of distances? If so, this smells like a variation on [matching](https://en.wikipedia.org/wiki/Matching_(graph_theory)). – Gulzar May 03 '21 at 11:44
  • FWIW in my most recent work requiring nearest neighbor search, I seem to recall that `scipy.spatial.cKDTree` was considerably faster than the `sklearn` offerings. But the exact matching problem you're describing sounds quite difficult—might be a variation on the knapsack problem, which for exact solutions is at least [NP-Complete](https://en.m.wikipedia.org/wiki/Knapsack_problem#Computational_complexity). Can you tolerate approximate solutions? – senderle May 03 '21 at 11:46
  • There is an `O(V^2E)=O(n^3)` (^3 for grid graphs, which can be "somewhat" the case here, or ^4 for the general case). [here](https://en.wikipedia.org/wiki/Maximum_weight_matching). – Gulzar May 03 '21 at 11:49
  • Yeah interesting that @Gulzar thought of a different NP-Complete problem (at least in the max-min formulation of matching). So it seems likely that this is unavoidably hard if you need exact solutions. There are probably fast(ish) approximation algorithms though. – senderle May 03 '21 at 11:56
  • @senderle actually Wikipedia claims matching is polynomial. What did I miss? – Gulzar May 03 '21 at 11:57
  • 1
    @Gulzar I was looking at the max-min formulation of the problem, but yeah you're right I'm not sure which one this would be equivalent to. (I was in the middle of editing my comment when you replied, sorry.) – senderle May 03 '21 at 12:00
  • @sendere Yes I can tolerate approximate solutions. As I mentioned in my question, in most cases all current points will be closest to their previous point in time, in some cases only it gets more complicated (see pts1-3 above), but still that should then be only for 1 or a few out of all points, which in my case will anyway be a maximum of ~10 – crazjo May 03 '21 at 12:01
  • In your examples, in all points [x,y] we have x=y, i.e. they are taken from the same line f(x)=x and they are ordered, i.e. x2>x1 and y2>y1. Will all your points have these properties? – Max Pierini May 03 '21 at 15:55
  • @MaxPierini thanks for your question, no, I should maybe have given a more random example. x and y will both vary between 0 and 1000 and the points won't be ordered – crazjo May 03 '21 at 19:16

1 Answers1

8

This can be formulated as a mixed integer linear programming problem.

In python you can model and solve such problems using cvxpy.

def connect_point_cloud(points):
    '''
    Given a set of points computes return pairs of points that
    whose added distance is minimised
    '''
    N = points.shape[0];
    I, J = np.indices((N, N))
    d = np.sqrt(sum((points[I, i] - points[J, i])**2 for i in range(points.shape[1])));
    
    use = cvxpy.Variable((N, N), integer=True)
    # each entry use[i,j] indicates that the point i is connected to point j
    # each pair may count 0 or 1 times
    constraints = [use >= 0, use <= 1];
    # point i must be used in at most one connection
    constraints += [sum(use[i,:]) + sum(use[:, i]) <= 1 for i in range(N)]
    # at least floor(N/2) connections must be presented
    constraints += [sum(use[i,j] for i in range(N) for j in range(N)) >= N//2];
    
    # let the solver  to handle the problem
    P = cvxpy.Problem(cvxpy.Minimize(sum(use[i,j] * d[i,j] for i in range(N) for j in range(N))), constraints)
    dist = P.solve()
    return use.value

Here a piece of code to visualize the result for a 2D problem

# create a random set with 50 points
p = np.random.rand(50, 2)
# find the pairs to with minimum distance
pairs = connect_point_cloud(p)

# plot all the points with circles
plt.plot(p[:, 0], p[:, 1], 'o')

# plot lines connecting the points
for i1, i2 in zip(*np.nonzero(pairs)):
    plt.plot([p[i1,0], p[i2,0]], [p[i1,1], p[i2,1]])

enter image description here

Bob
  • 13,867
  • 1
  • 5
  • 27
  • Wow I am impressed, this works extremely well, thank you very much! – crazjo May 04 '21 at 08:23
  • I was wondering if you could help me further with this. I will have a list of previous points with identities e.g. `prevdat = {'loc': [(300, 200), (425, 400), (400, 300)], 'contid': [0, 1, 2]}` and a similar list of current points (`curdat`), with nr of ids in prevdat and curdat varying between 0 and 10 (for simplicity now), but mostly only differing 1 or 2 in length. How can I use your above code to find the pairs such that the contids of `curdat` are optimally linked to the ids of `prevdat`? i.e. locs should not be compaired within `curdat` and `prevdat` but between them. Thanks! – crazjo May 04 '21 at 08:34
  • Could you create a follow up question, please? Try to address the following points: Given a solution what is the cost of the mentioned linking. Do you want add this cost to the sum of the distances (possibly with a scale), or you want to chose among all the configurations with minimum distance one with minimum link cost? – Bob May 04 '21 at 12:23
  • I am happy to make it into a separate question if you think that is helpful. I am not so familiar with these kinds of problems so don't fully understand your questions. What do you mean with "the cost of the mentioned linking"? I feel with your answer above it is only a couple lines extra/changed to solve my more specific problem. Thanks! – crazjo May 06 '21 at 06:57
  • You wrote "such that the contids of curdat are optimally linked to the ids of prevdat", my interpretation of that is that you may have different linking between `contids of curdat` and `ids of prevdat`, if you want to optimize any quantity you need to be able to measure it. – Bob May 06 '21 at 10:07
  • The main difference to the problem you solved with your excellent answer is that it wont be one dataframe with points but two dataframes with points, with the points being generated by tracking data of moving objects. As the nearest pt in time will 99% be of the same object, I think using your solution but then adapted for two dataframes and allocating of ID numbres using a dictionary is all that is needed, no? But as I am very unfamiliar with the cvxpy package I was hoping you could have some good ideas. Thanks again! – crazjo May 06 '21 at 14:30
  • As I said, create a new question. This question is good as is, the problem is well defined, is a problem that likely to be interesting to other people, and the solution is straight forward (complex under the hood, but straight forward). Also, it is good if you can provide an example data for the new problem. – Bob May 06 '21 at 15:17
  • Thanks, I have now asked the question here: https://stackoverflow.com/questions/67430853/temporal-linking-of-point-identities-based-on-minimal-distances – crazjo May 07 '21 at 07:48
  • as the other question is not being answered, is there maybe a simpler solution in that your function could be expanded to not have one array of points but two and the points should be paired with points of the other array only, not within. My knowledge of the `cvxpy` package is too limited to know how to do this. Thanks @bob – crazjo May 19 '21 at 12:06