1

In short: I am using k-means clustering with correlation distance. How to check, how many clusters should be used, if any?

There are many indices and answers on how to establish a number of clusters when grouping data: example 1, example 2, etc. For now, I am using Dunn's index, but it is not sufficient due to one of the reasons described below.

All those approaches exhibit at least one of following problems, I have to avoid:

Indexes:

  • clustering quality index derivation makes some assumptions regarding data covariance matrix, i.e. since such moment only euclidean or euclidean-like metrics apply - correlation one is not an option anymore
  • it requires at least two nonempty clusters to compare already calculated partitions - there is no possibility to state whether there is any reason to make a division into groups at all

Clustering approaches:

  • clustering approaches estimating number of clusters itself (e.g. affinity propagation) are much slower and do not scale well

To sum up: is there any criterion or index, which allows to check for existence of groups in data (maybe estimating number of them), without limitation on metric used?

EDIT: Space I am operating on has up to few thousands features.

gmrukwa
  • 11
  • 3

1 Answers1

0

I have a method, but it is my own invention and rather experimental. Whilst theoretically it works in multi-dimensions, I've only had any success in 2D (take the first two principal components if clustering multi-dimensional data).

I call it gravitational clustering. You pass in a smear, then you produce a attraction round each point using 1 / (d + smear)^2 (smear prevents values going to infinity, and controls the granularity of the clustering). Points them move uphill to their local maximum on energy field. If they all move to the same point, you have no clusters, if they move to different points, you have clusters, if they all stay at their their own local maximum, again you have no clusters.

Malcolm McLean
  • 6,258
  • 1
  • 17
  • 18
  • For the data we were investigating it occurs, that PCA is too lossy for small nuances inside, while we are operating on even thousands of features. Either way, I will look at this method. – gmrukwa May 24 '17 at 10:13
  • Sounds like meanshift clustering to me. – Has QUIT--Anony-Mousse May 25 '17 at 08:31
  • It depends on what is this d parameter, doesn't it? – gmrukwa May 25 '17 at 08:43
  • No d is the distance from any point on the surface to a data point. It depends on smear. If smear is 0, energy is infinity at every data point, and there are as many clusters as unique data points. As we make smear larger, points merge into clusters. – Malcolm McLean May 25 '17 at 09:36
  • @MalcolmMcLean Now I get it. Still, this requires multiple starts, etc. just to judge, whether there is anything to be clustered. Good to know such approach to clustering, but that's not exactly what I'm looking for. And, like mentioned by Anony-Mousse, concept is quite similar to mean-shift in this context (checking centroids' trajectories). – gmrukwa May 25 '17 at 09:53