2

So, you have got a set of "explored" points in space, as well as a set of "unexplored" points. You want to pick K unexplored points to explore so that the mean distance from the unexplored points to their closest explored point is minimized.

Sketch of the problem

Could this be done more effectively than by brute force picking the unexplored points one by one and measuring the mean distance?

I have got the python function below that gets the job done. But it is not feasible for large sets as it gets very slow. I want to use this for a set of at least hundreds of thousands of unexplored points. So it needs to be more effective. I do not need an optimal solution, a good approximation would do!

Could this somehow be done without the nested for-loops?

Or could somehow only the most likely points be chosen for evaluation?

All ideas will be highly appreciated!

import numpy as np

explored = np.random.rand(100,3)
unexplored = np.random.rand(100000,3)

def k_anchors(explored, unexplored, K):

    anchors = np.empty((K, unexplored.shape[1]))

    for j in range(K):
        proximity_sum = np.zeros((len(unexplored),))

        for k in range(len(unexplored)):
            temp_results = np.concatenate(( explored, unexplored[k].reshape((-1,3)) ))
            proximity = np.zeros((len( unexplored ),))

            for i in range(len( unexplored )):
                i_prox = (abs((unexplored[i,:] - temp_results))).sum(axis=1)
                proximity[i] = i_prox.min()

            proximity_sum[k] = proximity.sum()

        idx = np.argmin( proximity_sum )
        anchors[j,:] = unexplored[ idx ]
        unexplored = np.delete(unexplored, idx, 0)
        explored = np.concatenate(( explored, unexplored[ idx ] ))

    return anchors

print( k_anchors(explored, unexplored, 5) )

SOLUTION

The problem was solved with a variation on the K means algorithm as proposed by Barış Can Tayiz, and it worked like a charm.

In short, I initialized the explored points as centroids, along with K random points. Only the K random points were then varied upon fitting the data. For me, the number K did not need optimizing as I now every time the function is called how many points I will be able to explore.

Thanks to everyone who took precious time to discuss and answer this question!

VIctor
  • 35
  • 6
  • 3
    https://stackoverflow.com/questions/42806059/fitting-largest-circle-in-free-area-in-image-with-distributed-particle/42807721 – Matt Timmermans Mar 12 '20 at 14:29
  • 1
    By "mean distance" do you mean the arithmetic average of the distances of the farthest and nearest point? (For example, the mean distance between the sun and the earth is the average of two distances: the aphelion and the perihelion.) Or do you mean the distance of the centroid or "center of mass" of the set of unexplored points? – Adrian McCarthy Mar 12 '20 at 15:51
  • Sorry for being vague. By distance, I mean the distance from any unexplored point to it’s closest explored point. And then by mean distance, I mean the mean of those distances for all unexplored points. So if point a is 3m from an explored point and point b is 5m from an explored point, then the mean distance (as I’m using the term) would be 4. – VIctor Mar 12 '20 at 16:19
  • Does this answer your question? [Fitting largest circle in free area in image with distributed particle](https://stackoverflow.com/questions/42806059/fitting-largest-circle-in-free-area-in-image-with-distributed-particle) – AMC Mar 12 '20 at 20:20
  • Matt Timmermans and AMC: To me this seems to provide a point that is as far as possible from other points. Whereas I want the point that minimizes the distance for as many other points as possible. I should be the first to say that I might be missing the point though. If I have misunderstood the contents of that discussion, I would be most grateful if you would clarify how I could use it. – VIctor Mar 13 '20 at 07:24

1 Answers1

1

You can use unsupervised learning algorithms for that purpose. For example, If you select k = 3 for k means, the closest points to centers must be explored. Selecting k is another problem. You can reach that looking this article https://medium.com/analytics-vidhya/how-to-determine-the-optimal-k-for-k-means-708505d204eb. You can use for Within-Cluster-Sum of Squared Errors (WSS) the difference of n+1th - nth / nth - n-1th. This ratio will give the best k while measuring the WSS.

  • I certainly like this answer! It should be simple enough to implement. Although, I am not sure how to treat the already explored points. I’ll try to write something up and see if it works out. Maybe the explored points can be treated as static centroids. – VIctor Mar 12 '20 at 19:39