3

I have a one-dimensional data set for which the histogram plot shows multiple local maxima, so I know that there are multiple regions in my one-dimensional space where the data is more dense. I want to determing boundaries for these dense regions that allow me to classify the dense region / cluster that a certain data point is in. For this I am using OPTICS, because it should be able to better deal with the different densities between the clusters compared to DBSCAN.

I am using ELKI (version 0.6.0) in Java code (I know it is disrecommended by the ELKI team to embed ELKI in Java, but I need to repeat my workflow for many datasets and therefore its better to automate this in my case). Code snippet below prints indices of the start and end items of the clusters. The ELKI documentation on OPTICSModel does not clearly define what these index numbers correspond to, but I assume these are the indices of the start and end data items in the augmented cluster-ordering of the database (like the ClusterOrderResult object that OPTICS.run()-created), as opposed to indices of the start and end data items of the database itself (unordered).

ListParameterization opticsParams = new ListParameterization();
opticsParams.addParameter(OPTICSXi.XI_ID, 0.01);
opticsParams.addParameter(OPTICS.MINPTS_ID, 100);
OPTICSXi<DoubleDistance> optics = ClassGenericsUtil.parameterizeOrAbort(OPTICSXi.class, opticsParams);

ArrayAdapterDatabaseConnection arrayAdapterDatabaseConnection = new ArrayAdapterDatabaseConnection(myListOfOneDimensionalFeatureVectors.toArray(new double[myListOfOneDimensionalFeatureVectors.size()][2]));
ListParameterization dbParams = new ListParameterization();
dbParams.addParameter(AbstractDatabase.Parameterizer.INDEX_ID, RStarTreeFactory.class);
dbParams.addParameter(RStarTreeFactory.Parameterizer.BULK_SPLIT_ID, SortTileRecursiveBulkSplit.class);
dbParams.addParameter(AbstractDatabase.Parameterizer.DATABASE_CONNECTION_ID, arrayAdapterDatabaseConnection);

Database db = ClassGenericsUtil.parameterizeOrAbort(StaticArrayDatabase.class, dbParams);
db.initialize();

result = optics.run(db);
List<Cluster<OPTICSModel>> clusters = result.getAllClusters();
    for(Cluster<OPTICSModel> cluster : clusters){
        if(!cluster.isNoise())
            System.out.println(cluster.getModel().getStartIndex() + ", "+ cluster.getModel().getEndIndex() +";  ");
    }

Now I want to know where in my one-dimensional space my clusters start and end. Therefore I would like to retrieve the data items corresponding to the start and end indices that my code above already obtains. I assume that I would need a ClusterOrderResult-object for that from which I could then retrieve the obtained indices. In the documentation however it seems like it is not possible to retrieve such a thing from the Clustering result object that I obtained by calling optics.run(). As there seemed to be no way of obtaining this ordered databased, I naively tried obtaining the indices from my original input dataset instead by replacing the println in the code above with the println below:

System.out.println(myListOfOneDimensionalFeatureVectors.get(cluster.getModel().getStartIndex())[0] + ", "+ myListOfOneDimensionalFeatureVectors.get(cluster.getModel().getEndIndex())[0] +";  ";

As I allready expected however, the indices do not seem to belong to the original input file, as this regularly prints end boundaries with lower values in my one dimensional space than the end boundaries. Does anybode know any way to obtain the original 1-dimensional data values that correspond to the start and end indices found with OPTICS clustering? I want to use these values later in my code.

Has QUIT--Anony-Mousse
  • 76,138
  • 12
  • 138
  • 194
Niek Tax
  • 841
  • 1
  • 11
  • 30

1 Answers1

3

For purpose of automation, it does work very well to call ELKI from the command line. That is my preferred way, because this way each run is nicely isolated in its own JVM.

You would then have easy access to this data from the output files.

Why are you using an old version of ELKI? 0.6.5 versions are much nicer because of the removed generics. Although I've switched to the github version now.

If you want direct access to the ClusterOrder object, it's attached to the clustering object as a child result. You should be able to get it using

ClusterOrder clusterOrder = ResultUtil.filterResults(clustering, ClusterOrder.class).get(0);

and its object ids via:

ArrayDBIDs ids = DBIDUtil.ensureArray(clusterOrder.getDBIDs());

(The ensureArray is overhead, but it's a noop then anyway - it's a cast-or-convert operation, and here it will be a cast; at least in my ELKI version the ids are always stored as ArrayDBIDs)

Array iterators (DBIDArrayIter it = ids.iter()) can be moved to a position via seek(offset). So you should be able to use something like

DBIDArrayIter it = ids.iter();
NumberVector vec = relation.get(it.seek(model.getStartIndex()));

The iterators in ELKI are odd for Java APIs, but very fast if you use a single iterator for all your accesses.

So much for your ELKI question part. However, from a statistical point of view it does not make sense to use OPTICS on 1-dimensional data. On one-dimensional data, use proper kernel density estimation instead. OPTICS is a rough and crude method that makes sense when your data is too complicated to model using proper statistical tools. OPTICS uses a very primitive kernel density, and the xi method is a very naive extraction of clusters from the density plot... at least on one-dimensional data, statistics offers stronger tools. ELKI has an implementation called KNNKernelDensityMinimaClustering, but I have not used it yet. But kernel density estimation should be available in any statistical toolkit, so I would give this class a try.

Has QUIT--Anony-Mousse
  • 76,138
  • 12
  • 138
  • 194
  • Thank you, both for the answer and your feedback on the statistical part! I had actually been trying out KDE techniques before trying OPTICS. The problem was that after finding a KDE that fits the data well, it is still far from trivial to obtain clusters from it. Good to hear that KNNKernelDensityMinimaClustering might offer me some possibilities in this direction. Do you happen to know if there is info on how KNNKernelDensityMinimaClustering calculates the clusters from the kernel densities? The Javadoc of the class and publication list on ELLI website do not seem to provide info on this. – Niek Tax Apr 19 '15 at 18:06
  • No idea. Usually they have a reference in the JavaDoc, but this class doesn't. – Has QUIT--Anony-Mousse Apr 19 '15 at 18:34
  • I switched to the github version of ELKI as well. One more question on the code snippets that you provided though. what is the type of the variable 'relation' in your example? At first I thought that it would be the Database, but the it.seek returns an object of type DBIDArrayIter, and putting the Database there results in the error "get(DBIDArrayIter) underfined for type Database". – Niek Tax Apr 20 '15 at 08:30
  • I fixed it myself. I found out that I could extract a Collection from the Dabase object with the method getRelations(). – Niek Tax Apr 20 '15 at 09:43
  • Sorry, `KNNKernelDensityMinimaClustering` is based on my own considerations, I could not find a backing reference to it; yet the method is too simple to be published anywhere. Please cite ELKI for this method, if you use it. – Erich Schubert Apr 27 '15 at 11:01
  • Also, I will modify `ClusterOrder.getDBIDs()` so you do not need the `ensureArray` call anymore. Thanks for this feedback. – Erich Schubert Apr 27 '15 at 11:34