3

I'm currently searching for an efficient algorithm that takes in a set of points from three dimensional spaces and groups them into classes (maybe represented by a list). A point should belong to a class if it is close to one or more other points from the class. Two classes are then the same if they share any point. Because I'm working with large data sets, I don't want to use recursive methods. Also, using something like a distance matrix with O(n^2) performance is what I try to avoid.

I tried to check for some algorithms online, but most of them don't appeal to this specific purpose (e.g. k-d tree or other cluster algorithms). I thought about parting space into smaller parts, but that (potentially) results in an inexact result.

I tried to write something myself, but it turned out to be flawed. I would sort my points after distance and append the distance as a fourth coordinate and then repeat the following the following code-segment:

def grouping_presorted(lst, distance):
    positions = [0]
    x = []
    while positions:
        curr_el = lst[ positions[-1] ]
        nn_i = HasNeighbor(lst, distance, positions[-1])

        if nn_i is None:
            x.append(lst.pop(positions[-1]) )
            positions.pop(-1)
        else:
            positions.append(nn_i)
    return x

def HasNeighbor(lst,distance,index):
    i =  index+1
    while lst[i][3]- lst[index][3] < distance:
        dist = (lst[i][0]-lst[index][0])**2 + (lst[i][1]-lst[index][1])**2 + (lst[i][2]-lst[index][2])**2
        if dist < distance:
            return i
        i+=1
    return None

Aside from an (probably easy to fix) overflow error, there's a bigger flaw in the logic of linking the points. If you think of my points describing lines in space, the algorithm only works for lines that strictly point outwards the origin, but not for circles or similar structures.

Does anybody know of a prewritten code for this or have an idea what I could try?

Thanks in advance.

Edit: It seems my spelling and maybe confusion of some terms has sparked some misunderstandings. I hope that this (badly-made) sketch helps. In this example, I marked my reference distance as d and circled the two containers I wan't to end up with in red. sample

Banana
  • 1,149
  • 7
  • 24
  • You probably didn't understand spatial algorithms like `kd-tree`'s well, because that's how you manage point clouds. – user1767754 Dec 26 '17 at 06:48
  • The problem is not well defined. "Two points are supposed to end up in the same class when their distance is smaller than some fixed value." Does that mean that classes can overlap? – timgeb Dec 26 '17 at 06:49
  • 1
    Also, why not use bog-standard k-means with the two adjustments that a centroid must be at the same coordinates as an existing point and that members of a class must lie within a set maximum distance from the centroid? – timgeb Dec 26 '17 at 06:50
  • 1
    In addition, if I understand you correctly and you want a guaranteed optimum result (e.g. no random starting classes) there is no efficient algorithm (unless P = NP). – timgeb Dec 26 '17 at 06:52
  • @user1767754 You're right, I didn't. From what I read, I would end up with hierarchies of classes but that's not what I want. – Banana Dec 26 '17 at 06:53
  • 1
    @timgeb I don't understand your point, this problem is mathematically well-defined as an equivalence relation. Two points share the same equivalence class if their distance is smaller than d. That means each point belongs to exactly one class. So no, classes can't overlap, because then all of their points would belong to the same class. Note that I did not say they would **only** belong to the same class if this relation holds. They could both be close to a third point and therefore lie in the same class. – Banana Dec 26 '17 at 06:54
  • @timgeb I just don't want any outcome mistakes (stuff like: "once in 1000 runs I statistically end up with two points not in the same class although they are in proximity of each other" should not happen) – Banana Dec 26 '17 at 06:56
  • @Banana consider the three points A=(-1,0,0), B=(0,0,0) and C=(1,0,0) with the maximum euclidean distance 1. Given your constraints "Two points are supposed to end up in the same class when their distance is smaller than some fixed value." we will end up with the overlapping classes {A,B} and {B,C}. – timgeb Dec 26 '17 at 06:57
  • 1
    @timgeb no.. Given "Two points are supposed to be in the same class when their distance is smaller than 1" here means A and B are in the same class, B and C are in the same class, so the class is {A,B,C}. Edit: I never said a class only contains two points. – Banana Dec 26 '17 at 06:58
  • @Banana but A and C are further apart than the maximum distance, yet supposed to be in the same class. As I said, your problem statement is not well defined. Therefore this question needs to be closed as unclear. – timgeb Dec 26 '17 at 07:00
  • @timgeb I think you should read up the definitions for an equivalence class. I mentioned in an earlier comment: Two points not satisfying the requirement doesnt mean they dont share the same class. They can be equivalent by transitivity. – Banana Dec 26 '17 at 07:01
  • "I think you should read up the definitions for an equivalence class." I know what an equivalence class is. It is *your* duty to post a self contained question if you expect any help. Posting a garbage problem statement and then demanding the people who are supposed to help you to read up on what you *actually* mean is not how this site works. – timgeb Dec 26 '17 at 07:02
  • 1
    @timgeb I tried my best to clarify all uncertainties. If you have any constructive suggestion on how to reformulate the problem, I'm happy to do so. Even if I phrased it wrong, so far all you said is "It's not well defined that way". Edit: I rethought and you're right, it does not satisfy the definition of an equivalence relation. I'll try again differently: I want to throw points of coordinates together into containers. Given a container, I want to throw an element into there if it is close to one or more of the current elements. – Banana Dec 26 '17 at 07:08
  • 1
    @timgeb Why are you taking out your frustrations on this guy? His problem was very clearly defined. He said, "Two points are supposed to end up in the same class when their distance is smaller than some fixed value." This in no way implies that the distance between **any** two points in a given class is less than some fixed value. It only implies that for every point in a class, there is **at least one** other point to which its distance is less than some fixed value. – ubadub Dec 26 '17 at 07:09
  • @Banana alright, please edit that info into your question such that readers don't have to follow the string of comments in order to know what's going on. Second, I can't help you reformulate the problem because I still think the problem satement as it is now is incomplete because there's no rule against overlapping containers in there. – timgeb Dec 26 '17 at 07:10
  • 1
    @timgeb are you being intentionally obtuse? There cannot be overlapping containers. – ubadub Dec 26 '17 at 07:11
  • @ubadub I'm actually trying to get the problem statement well defined. Since there seem to be people who think it already is, I'll just remove my downvote and depart. – timgeb Dec 26 '17 at 07:12
  • Possible duplicate of [Unsupervised clustering with unknown number of clusters](https://stackoverflow.com/questions/10136470/unsupervised-clustering-with-unknown-number-of-clusters) Also have a look at this: https://en.wikipedia.org/wiki/Single-linkage_clustering – moooeeeep Dec 29 '17 at 15:49
  • @moooeeeep thanks for the reference, that indeed looks pretty similar. However, I already posted as an answer my final codes (and at least the graph theoretical one has not been mentioned in any of the other posts). Do you know how hierarchiel clustering compares to DBSCAN in terms of runtime/complexity? – Banana Dec 29 '17 at 16:15
  • I believe it's `n^2` vs `n log n`. If you care about actual performance you should probably measure and compare for yourself. – moooeeeep Dec 29 '17 at 18:20

3 Answers3

3

What I ended up doing

After following all the suggestions of your comments, help from cs.stackexchange and doing some research I was able to write down two different methods for solving this problem. In case someone might be interested, I decided to share them here. Again, the problem is to write a program that takes in a set of coordinate tuples and groups them into clusters. Two points x,y belong to the same cluster if there is a sequence of elements x=x_1,..,y=x_N such that d(x_i,x_i+1)


DBSCAN: By fixing euclidean metric, minPts = 2 and grouping distance epsilon = r. scikit-learn provides a nice implementation of this algorithm. A minimal code snippet for the task would be:

from sklearn.cluster import DBSCAN
from sklearn.datasets.samples_generator import make_blobs
import networkx as nx
import scipy.spatial as sp

def cluster(data, epsilon,N): #DBSCAN, euclidean distance
    db     = DBSCAN(eps=epsilon, min_samples=N).fit(data)
    labels = db.labels_ #labels of the found clusters
    n_clusters = len(set(labels)) - (1 if -1 in labels else 0) #number of clusters
    clusters   = [data[labels == i] for i in range(n_clusters)] #list of clusters
    return clusters, n_clusters

centers = [[1, 1,1], [-1, -1,1], [1, -1,1]]
X,_ = make_blobs(n_samples=N, centers=centers, cluster_std=0.4,
                            random_state=0)
cluster(X,epsilon,N)

On my machine, N=20000 for this clustering variation with an epsilon of epsilon = 0.1 takes just 290ms, so this seems really quick to me.


Graph components: One can think of this problem as follows: The coordinates define nodes of a graph, and two nodes are adjacent if their distance is smaller than epsilon/r. A cluster is then given as a connected component of this graph. At first I had problems implementing this graph, but there are many ways to write a linear time algorithm to do this. The easiest and fastest way however, for me, was to use scipy.spatial's cKDTree data structure and the corresponding query_pairs() method, that returns a list of indice tuples of points that are in given distance. One could for example write it like this:

class IGraph:
    def __init__(self, nodelst=[], radius = 1):
        self.igraph = nx.Graph()
        self.radii  = radius
        self.nodelst = nodelst #nodelst is array of coordinate tuples, graph contains indices as nodes
        self.__make_edges__()

    def __make_edges__(self):
        self.igraph.add_edges_from( sp.cKDTree(self.nodelst).query_pairs(r=self.radii) )

    def get_conn_comp(self):
        ind = [list(x) for x in nx.connected_components(self.igraph) if len(x)>1]
        return [self.nodelst[indlist] for indlist in ind]


def graph_cluster(data, epsilon):
    graph = IGraph(nodelst = data, radius = epsilon)
    clusters = graph.get_conn_comp() 
    return clusters, len(clusters)

For the same dataset mentioned above, this method takes 420ms to find the connected components. However, for smaller clusters, e.g. N=700, this snippet runs faster. It also seems to have an advantage for finding smaller clusters (that is being given smaller epsilon values) and a vast disadvantage in the other direction (all on this specific dataset of course). I think, depending on the given situation, both methods are worth considering.

Hope this is of use for somebody.

Edit: Theoretically, DBSCAN has computational complexity O(n log n) when properly implemented (according to wikipedia...), while constructing the graph as well as finding its connected components runs linear in time. I'm not sure how well these statements hold for the given implementations though.

Banana
  • 1,149
  • 7
  • 24
2

You could try https://en.wikipedia.org/wiki/OPTICS_algorithm. When you index the points first (e.g, with an R-Tree) this should be possible in O(n log n).

Edit:

If you already know your epsilon and how many points are minimally in a cluster (minpoints) then DBSCAN could be the better choice.

SaiBot
  • 3,595
  • 1
  • 13
  • 19
  • with epsilon = d and N=1 this algorithm seems to do exactly what I want. I don't know if the runtime is achievable with these parameters though, but I will have a closer look at this. Thank you! – Banana Dec 27 '17 at 04:34
  • regarding your edit: yes through your OPTICS suggestion I already found DBSCAN and there is another possible solution that I have in mind now. I will try to implement all and tune back in to compare. Thank you very much! – Banana Dec 27 '17 at 11:17
  • Glad that I could help. Regarding the runtime: Indexing 3d points with an R-Tree is possible in O(n log n ) Now you basically want to perform an epsilon range query (wich is in O (log n) if the epsilon is not too large) for each point. So you should get away with O(n log n) – SaiBot Dec 27 '17 at 11:36
0

Adapt a path-finding algorithm, such as Dijkstra's or A*, or alternatively adapt the breadth-first or depth-first search of a graph. Start at any point in the set of unvisited points, and proceed with whichever algorithm you've picked with the caveat that a point is considered to be connected only to all points to which its distance is less than the threshhold. When you've finished off with one class (i.e. when you can discover no more new nodes), pick any node from the set of unvisited nodes and repeat.

ubadub
  • 3,571
  • 21
  • 32
  • Thank you for the hint. But does establishing the graph, i.e. checking for "..connected only to all points to which its distance is less than the threshhold." not require me to do N! such checks? That is why I initially thought I could restrict myself to only checking subsets by ordering the list beforehand (I can do that during the creation of my points) – Banana Dec 26 '17 at 07:56
  • @Banana How are your points stored? Would it be feasible to store them in a graph to begin with? – ubadub Dec 26 '17 at 08:00
  • @Banana You said you wanted to avoid a distance matrix, but could you set up a matrix of boolean values which is precomputed before you run this algorithm, where a value of true indicates that two points have a distance no greater than threshhold (and false otherwise)? Also, are your point coordinates integers or floating points? – ubadub Dec 26 '17 at 08:05
  • I obtain the points from a function (which is a blackbox to me) in a for loop in the form of [x,y,z] lists. At the moment I insert them into a list such that the list is sorted in place. Storing them into a graph might be possible, but I'm not sure how to efficiently go on about this. – Banana Dec 26 '17 at 08:05
  • @Banana does the function tell you what the maximum coordinates are? i.e. are the coordinates bounded in any way? – ubadub Dec 26 '17 at 08:06
  • My coordinates are floating points and unfortunately created at runtime and also depend on some random coefficients :/ Precompilation seems rather unlikely to be possible – Banana Dec 26 '17 at 08:06
  • Yes all points are in a cube of fixed size for a given run through my problem. They come as the intersection of preimages of two functions F(x,y,z)=c1 and G(x,y,z)=c2 that contain random variables. However [x,y,z] are restricted to the cube. – Banana Dec 26 '17 at 08:08
  • 1
    @Banana I went through a couple texts on algorithms and I can't see how you could do this better than O(n^2). I would suggest asking https://cs.stackexchange.com/ – ubadub Dec 26 '17 at 08:15