1

How do I get the data points and centroid that are in a kmeans (llyod) cluster when I use elki?

Also could I plug in those points into one of the distance functions and get the distance between any two of the points?

This question is different, because the main focus of my question is retrieving the data points, not custom data points. Also the answer on the other thread is currently incomplete, since it refers to a wiki that is not functioning at the moment. Additionally I would like to know specifically what needs to be done, because the documentation on all of the libraries is a bit like a wild goose chase and it would be greatly appreciated that if you know/understand the library that you would be direct with the answer so that others with the same problem could also have a good solid reference to refer to, instead of trying to figure out the library.

benj rei
  • 329
  • 2
  • 5
  • 12
  • While this question mentions DBSCAN, the answer covers accessing the objects. [ELKI: Running DBSCAN on custom Objects in Java](http://stackoverflow.com/questions/30893319/elki-running-dbscan-on-custom-objects-in-java) and so does this one for hierarchical clustering: http://stackoverflow.com/q/17687533/1060350 – Has QUIT--Anony-Mousse Mar 03 '16 at 00:20
  • @Anony-Mousse In the example docs it uses the `getoffset` command and that returns numbers. Are they the data point in relation to their position in the db? How would I go about getting the centroid for each cluster? (Also btw all of the website for the library is down, and I don't think its on my end only). – benj rei Mar 03 '16 at 01:15

1 Answers1

2

A Cluster (JavaDoc) in ELKI never stores the point data. It only stores point DBIDs (Wiki), which you can get using the getIDs() method. To get the original data, you need the Relation from your database. The method getModel() returns the cluster model, which for kmeans is a KMeansModel.

You can get the point data from the database Relation by their DBID, or compute the distance based on two DBIDs.

The centroid of KMeans is special - it is not a database object, but always a numerical vector - the arithmetic mean of the cluster. When using KMeans, you should be using SquaredEuclideanDistanceFunction. This is a NumberVectorDistanceFunction, which has the method distance(NumberVector o1, NumberVector o2) (not all distances work on number vectors!).

Relation<? extends NumberVector> rel = ...;
NumberDistanceFunction df = SquaredEuclideanDistanceFunction.STATIC;

... run the algorithm, then iterate over each cluster: ...

Cluster<KMeansModel> cluster = ...;
Vector center = cluster.getModel().getMean(); 
double varsum = cluster.getModel().getVarianceContribution();

double sum = 0.;
// C++-style for loop, for efficiency:
for(DBIDRef id = cluster.getIDs().iterDBIDs(); id.valid(); id.advance()) {
   double distance = df.distance(relation.get(id), center);
   sum += distance;
}

System.out.println(varsum+" should be the same as "+sum);
Erich Schubert
  • 8,575
  • 2
  • 26
  • 42
  • When you get the mean doesn't it differ by cluster? Maybe I am not fully understanding the code, but it looks like you are only using one centroid for the distance function, when the average distance from the center, first gets the distance each point is from its centroid, and then adds those distances up. Also would Varsum be equal to the summation of each points distance from its cluster's center? – benj rei Mar 03 '16 at 12:06
  • This code snippet processes a *single* cluster (`Cluster != Clustering`); you still need another `for` loop over all clusters. – Erich Schubert Mar 03 '16 at 12:36
  • @ErichSchubert This information is very helpful. I have a problem on DBSCAN. Since the ELKI user mailing list is not in English, I don't know how to raise the question to you. Could you tell me if I can reach you in some way? My problem is that I tried the Apache math3 DBSCANClusterer and got the result I expected, but I don't know how to get the same result by using ELKI's DBSCAN. I could post a stackoverflow question, but really need the expert like you to help on this. We have the data set of hundred of millions -- we are worried these methods may not work. – Paul Z Wu Jan 16 '18 at 01:31