Evaluate the optimum number of clusters “k” in time-series clustering using dtwclust package

Question

I use following tsclust statement to cluster data

SURFSKINTEMP_CLUST <- tsclust(SURFSKINTEMP, k = 10L:20L,
                       distance = "dtw_basic", centroid = "dba",
                       trace = TRUE, seed = 938,
                       norm = "L2", window.size = 2L,
                       args = tsclust_args(cent = list(trace = TRUE)))

SURFSKINTEMP is very big,

str(SURFSKINTEMP)
List of 327239
 $ V1     : num [1:7] 0.13 0.631 -0.178 0.731 0.86 ...
 $ V2     : num [1:6] 0.117 -0.693 -0.911 -0.911 -0.781 ...
 $ V3     : num [1:7] 0.117 -0.693 -0.911 -0.911 -0.781 ...
 $ V4     : num [1:6] -0.693 -0.911 -0.911 -0.781 -0.604 ...

Then, I want use cvi to evaluate the optimum number of clusters “k”

names(SURFSKINTEMP_CLUST) <- paste0("k_",10L:20L)
sapply(SURFSKINTEMP_CLUST, cvi, type = "internal")

But, there have an errors

> sapply(SURFSKINTEMP_CLUST, cvi, type = "internal")
Error: cannot allocate vector of size 797.8 Gb

How can I evaluate the optimum number of clusters “k” in my case?

Run cvi against a sample of the data set, say 10,000 series. Do this a number of times to check for stability. If the result varies across samples, bootstrap (1000 replications or so) and take the average. — justin cress, Nov 29 '17 at 13:50
@pan something like `surfSkinSample <- SURFSKINTEMP[sample(seq_along(SURFSKINTEMP), 1e4)]`. Run your analysis on surfSkinSample. Save the results. Then repeat this 6 or 7 times to see if the same number of clusters is consistently the best. If you are getting mixed results, then perform a bootstrap of this process, taking the average of the best number of clusters as your best result. — lmo, Nov 29 '17 at 14:02

score 0 · Answer 1 · answered Nov 29 '17 at 14:08

The error message indicates you're trying to churn more data than available resources will support. In cases like these, attempt the analysis on a smaller sample. Repeat the analysis a number of times.

reps = 1000
samp_size = 10000
result <- c()
for(j in 1:reps){
    sample = SURFSKINTEMP[sample(seq_along(SURFSKINTEMP, samp_size)),]
    sample_clust <- tsclust(SURFSKINTEMP, k = 10L:20L,
                   distance = "dtw_basic", centroid = "dba",
                   trace = TRUE, seed = 938,
                   norm = "L2", window.size = 2L,
                   args = tsclust_args(cent = list(trace = TRUE)))

    result[j] <- sapply(sample_clust, cvi, type = "internal")

}

Provides a list of results you can inspect.

Without experience in tsclust, not sure how to get results from the output of CVI, but I assume you have this handled. If you're uncomfortable with the amount of variation between samples, increase the sample size until memory gives out. — justin cress, Nov 29 '17 at 14:11

score 0 · Accepted Answer · answered Nov 29 '17 at 20:26

Specifying type = "internal" will try to calculate 7 indices: Silhouette, Dunn, COP, DB, DB*, CH and SF. As mentioned in the documentation for cvi, the first 3 will try to calculate the whole cross-distance matrix, which in your case would be a 327,239 x 327,239 matrix; you're going to have a hard time finding a computer that can allocate that, and it would take a long time to compute.

Since you're using DBA for centroids, you could see if DB or DB* make sense for your application

sapply(SURFSKINTEMP_CLUST, cvi, type = c("DB", "DBstar"))

You could also look at the somewhat simple elbow method bearing in mind that you could calculate the sum of squared error (SSE) with (see documentation for TSClusters-class):

sapply(SURFSKINTEMP_CLUST, function(cl) { sum(cl@cldist ^ 2) })

Evaluate the optimum number of clusters “k” in time-series clustering using dtwclust package

2 Answers2