4

I am clustering timeseries data using appropriate distance measures and clustering algorithms for longitudinal data. My goal is to validate the optimal number of clusters for this dataset, through cluster result statistics. I read a number of articles and posts on stackoverflow on this subject, particularly: Determining the Optimal Number of Clusters. Visual inspection is only possible on a subset of my data; I cannot rely on it to be representative of my whole dataset since I am dealing with big data.

My approach is the following: 1. I cluster several times using different numbers of clusters and calculate the cluster statistics for each of these options 2. I calculate the cluster statistic metrics using FPC's cluster.stats R package: Cluster.Stats from FPC Cran Package. I plot these and decide for each metric which is the best cluster number (see my code below).

My problem is that these metrics each evaluate a different aspect of the clustering "goodness", and the best number of clusters for one metric may not coincide with the best number of clusters of a different metric. For example, Dunn's index may point towards using 3 clusters, while the within-sum of squares may indicate that 75 clusters is a better choice.

I understand the basics: that distances between points within a cluster should be small, that clusters should have a good separation from each other, that the sum of squares should be minimized, that observations which are in different clusters should have a large dissimilarity / different clusters should ideally have a strong dissimilarity. However, I do not know which of these metrics is most important to consider in evaluating cluster quality.

How do I approach this problem, keeping in mind the nature of my data (timeseries) and the goal to cluster identical series / series with strongly similar pattern regions together?

Am I approaching the clustering problem the right way, or am I missing a crucial step? Or am I misunderstanding how to use these statistics?

Here is how I am deciding the best number of clusters using the statistics: cs_metrics is my dataframe which contains the statistics.

Average.within.best <- cs_metrics$cluster.number[which.min(cs_metrics$average.within)]
Average.between.best <- cs_metrics$cluster.number[which.max(cs_metrics$average.between)]
Avg.silwidth.best <- cs_metrics$cluster.number[which.max(cs_metrics$avg.silwidth)]
Calinsky.best <- cs_metrics$cluster.number[which.max(cs_metrics$ch)]
Dunn.best <- cs_metrics$cluster.number[which.max(cs_metrics$dunn)]
Dunn2.best <- cs_metrics$cluster.number[which.max(cs_metrics$dunn2)]
Entropy.best <- cs_metrics$cluster.number[which.min(cs_metrics$entropy)]
Pearsongamma.best <- cs_metrics$cluster.number[which.max(cs_metrics$pearsongamma)]
Within.SS.best <- cs_metrics$cluster.number[which.min(cs_metrics$within.cluster.ss)]

Here is the result: Result of best cluster number for each metric

Here are the plots that compare the cluster statistics for the different numbers of clusters:

Average Distance Between and Within Clusters Average Silhouette Width Calinsky Criterion Dunn and Dunn2 index Entropy Pearson Gamma Within Cluster SS

Community
  • 1
  • 1
  • 2
    IMHO don't wait for an algorithm to make the end decision for you. You have many suggestions up there. You know your domain. I say make a judgment call. There is no "right" answer, just good ones and bad ones. You found some good ones. – Pierre L Jun 08 '16 at 12:48
  • 1
    If you need more suggestions, I find `NbClust` to be extensive. – Pierre L Jun 08 '16 at 12:56
  • I agree with Pierre, but the first comment generalizes, with `NbClust`, with: don't wait for _many_ algorithm _s_ to make the end decision for you ;-) – Vincent Bonhomme Jun 08 '16 at 12:58
  • Thanks! I would rather not use a pre-packaged algorithm to make the end decision for me without knowing and understanding what it is doing. This is why I am calculating the metrics, and then trying to decide on an optimal k from these metric comparisons using a more "custom" solution so to speak. I had a look at NbClust: are you suggesting I use it directly the following way: k <- NbClust(data = data, diss = dist_matrix, distance = NULL, min.nc = 2, max.nc = 80, method = "ward.D") ...or rather: that I try to implement myself some of the statistics that are used in the NbClust package? –  Jun 08 '16 at 14:50
  • ... continuation: If I understand correctly, NbClust seems to use a long list of indexes to determine the optimal k, which are listed on page 18 of the manual: http://cedric.cnam.fr/fichiers/art_2554.pdf Most of these I am not familiar with (I am new to clustering). I want to avoid randomly choosing a validation metric. –  Jun 08 '16 at 15:24
  • Clustering is an *explorative* method. Do *not* rely on such a measure. Every clustering is wrong. Some may be more interesting than others *if* you study them, but there is no measure of interestingness. **You must not take the human out of the loop.** – Has QUIT--Anony-Mousse Jun 08 '16 at 17:10
  • Thanks Anony - indeed, I do not want to take the human out of the loop, but how to do this when dealing with massive big data, a.k.a millions of timeseries? I understand that the goodness of the clustering is highly dependent on the nature of the data and the end goal, which is to match these timeseries by common patterns or partial patterns. At the moment I am using Frechet distance as a measure of timeseries similarity, and hierarchical clustering on that distance matrix. –  Jun 09 '16 at 08:08

0 Answers0