Why does Weka XMeans systematically underestimate the number of clusters?

Question

I'm having some trouble with Weka's XMeans clusterer. I've talked to a couple of other humans and we all agree that there are six clusters in the screenshot below, or at least a minimum of two if you squint really hard. Either way, xMeans does not seem to agree.

XMeans seems to systematically underestimate the number of clusters, based on whatever I have the minimum cluster count set to. With the maximum number of clusters held at 100, here are the results I get:

-L 1 // 1 cluster
-L 2 // 2 clusters
-L 3 // 3 clusters
-L 4 // 5 clusters
-L 5 // 6 clusters
-L 6 // 6 clusters

Most egregiously, with -L 1 (and -H 100) only a single cluster is found. Only by getting the minimum cluster count to five do I actually see the six clusters. Cranking the improve-structure parameter way up (to 100,000) does not seem to have any effect. (I've also played with other options without seeing any difference.) Here are the options that generated the above screenshot, which found one center:

private static final String[] XMEANS_OPTIONS = {
    "-H", "100",         // max number of clusters (default 4)
    "-L", "1",           // min number of clusters (default 2)
    "-I", "100",         // max overall iterations (default 1)
    "-M", "1000000",     // max improve-structure iterations (default 1000)
    "-J", "1000000",     // max improve-parameter iterations (default 1000) 
};

Obviously I'm missing something here. How do I make XMeans behave as expected?

Try X-means in ELKI. Does it work better? Then may there is a bug... but in my experience, x-means does frequently get the number of clusterd wrong, even on easy synthetic data like this — Has QUIT--Anony-Mousse, Feb 22 '16 at 20:28
Can you share the data, e.g., on drop box? I have a hypothesis what is going wrong. — CAFEBABE, Feb 23 '16 at 22:24
@CAFEBABE - the data is [here](https://www.dropbox.com/s/739sg4vkvkt51jr/six_clusters.dat?dl=0), in space-separated format. Would love to get your thoughts. — Sasgorilla, Feb 24 '16 at 14:30

CAFEBABE · Accepted Answer · 2016-02-26T22:07:47.980

I think it is what I was afraid of. A psychological problem (I think if you search for Gestalttheorie you might find some explanation on the perception aspect).

The human eye is grasping the form of the clusters and finds six circles. However, k-means and therefore x-means only looks at the distance. Hence, the cluster look rather awkward. Also after multiple restarts with 6-means I almost always achieved clusters like: This one might be a local minima which might be solved by xmeans

or for 3-means

Which is quite interesting as some points clearly violate the expectations.

If you use the K-Means in R you can analyse the withinss of the cluster result. These show that these awkward looking cluster typically perform quite well. Hence, there is no convergence to a different result to be expected.

I think this could be resolvable by using a different distance measure. For example the squared euclidean distance which enforce circle like shapes. Or using some kernel-based clustering technique with a RBF Kernel

===============================================

EDIT1: However, one aspect of the results of weka are still rather awkward, I used RWeka to run a few experiments. Basically I did run 100 cluster-runs for each initial cluster size between 2 and 7. My expectation was that for 2 and 6 the clusters are rather stable for the other cluster sizes I would have expected them to grow.

However, the results are differently

So basically 2 and 6 are fairly stable, however, the cluster sizes are always expended by at most 1.

So lets have a look at the BIC

What we can observe is that the BIC is not increasing when increasing the cluster size, however, strongly dependent on the initial cluster size.

Let's drill down a bit further and look at initial cluster size of 3. Running multiple restarts with different initial sizes generates (reproducible) the following two situations:

Nevertheless the results on the BIC seem to suggest that there is a BUG in the BIC calculation.

Also, I'm having trouble letting go of my gestalt theories. Take your first figure, above. Xmeans found six clusters (i.e., colors). Are we saying that the total variance is actually lower assigning those clusters as shown than by assigning them as I would naively? It seems so clear to me that the total variance would be lowered by splitting the red/blue cluster into two and giving each of the (visual) clusters on the right its own center (keeping the number of clusters constant so as not to perturb the BIC) — but I haven't actually tried it. — Sasgorilla, Mar 03 '16 at 14:23
think that human perception, and the psychological 'problem' actually tend to be the ideal behavior of these sort of pseudo 'intelligent' solutions. Otherwise, you'd have a clusterer clustering dachshunds with cats after a dimensionality reduction, because of distance rather than presence in an apparent circle/ellipse. A better example might be an object versus a reflection of the object: the psychological 'problem' allows our mind to very easily discriminate between illusions of similarity, and actual similarity. In this case, the distance metric is the illusion. — Chris, Aug 15 '17 at 15:20

Why does Weka XMeans systematically underestimate the number of clusters?

1 Answers1