0

My data set is composed by measurement of the same index for 14 years (columns) for 105 countries (rows). I want to cluster countries based on their index trend over time.

I am trying Hierarchical clustering (hclust) and K Medoids (pam) exploiting DTW distance matrix (dtw package).

I also tried K Mean, using the DTW distance matrix as first argument of function kmeans. The algorithm works, but I'm not sure about the accuracy of that, since K Mean exploit Eucledian Distance and computes centroids as means.

I am also thinking about using data directly, but I can't understand how the result would be accurate since the algorithm would consider different measurement of the same variable over time as different variables in order to compute the centroids at each iteration and Eucledian distance to assign observations to clusters. It doesn't seem to me that this process could cluster time series as well as Hierarchical and K Medoids clustering.

Is K Mean algorithm a good choice when clustering Time Series or it is better to use algorithms that exploit distance concept as DTW (but are slower)? Does it exist an R function that allows to use K Mean algorithm with distance matrix or a specific package to cluster Time Series data?

Miriam_G
  • 21
  • 1
  • 1
  • 7
  • 3
    This doesn't appear to be a specific programming question that's appropriate for Stack Overflow. If you have general questions about the appropriate use of various statistical methods, then you should ask such questions over at [stats.se] instead. You are more likely to get better answers there. Just asking if a function exists is better asked on google. Questions here should have include a simple [reproducible example](https://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example) with sample input and desired output to test possible solutions. – MrFlick Mar 02 '20 at 20:35
  • This is an interesting and important question, however, for better or worse the answer can only be known by trying it both ways and seeing what works better. A question about what R package to use will get more attention on stats.stackexchange.com or the r-help mailing list. – Robert Dodier Mar 04 '20 at 00:38

2 Answers2

0

KMeans will do exactly what you tell it to do. Unfortunately, trying to feed a time series dataset into a KMeans algo will result in meaningless results. The KMeans algo, and most general clustering methods, are built around the Euclidean distance, which does not seem to be a good measure for time series data. Quite simply, K-means often doesn’t work when clusters are not round shaped because of it uses some kind of distance function and distance is measured from cluster center. Check out the GMM algo as an alternative. It sounds like you are going with R for this experiment. If so, check out the sample code below.

Here is a KMeans cluster.

enter image description here

Here is a GMM cluster.

enter image description here

Which one looks more like a time series plot to you??!!

I Googled around for a good sample of R code to demonstrate how GMM clustering works. Unfortunately, I couldn't find anything decent. Personally, I use Python much more than I use R. If you are open to a Python solution, check out the sample code below.

import numpy as np
import itertools

from scipy import linalg
import matplotlib.pyplot as plt
import matplotlib as mpl

from sklearn import mixture

print(__doc__)

# Number of samples per component
n_samples = 500

# Generate random sample, two components
np.random.seed(0)
C = np.array([[0., -0.1], [1.7, .4]])
X = np.r_[np.dot(np.random.randn(n_samples, 2), C),
          .7 * np.random.randn(n_samples, 2) + np.array([-6, 3])]

lowest_bic = np.infty
bic = []
n_components_range = range(1, 7)
cv_types = ['spherical', 'tied', 'diag', 'full']
for cv_type in cv_types:
    for n_components in n_components_range:
        # Fit a Gaussian mixture with EM
        gmm = mixture.GaussianMixture(n_components=n_components,
                                      covariance_type=cv_type)
        gmm.fit(X)
        bic.append(gmm.bic(X))
        if bic[-1] < lowest_bic:
            lowest_bic = bic[-1]
            best_gmm = gmm

bic = np.array(bic)
color_iter = itertools.cycle(['navy', 'turquoise', 'cornflowerblue',
                              'darkorange'])
clf = best_gmm
bars = []

# Plot the BIC scores
plt.figure(figsize=(8, 6))
spl = plt.subplot(2, 1, 1)
for i, (cv_type, color) in enumerate(zip(cv_types, color_iter)):
    xpos = np.array(n_components_range) + .2 * (i - 2)
    bars.append(plt.bar(xpos, bic[i * len(n_components_range):
                                  (i + 1) * len(n_components_range)],
                        width=.2, color=color))
plt.xticks(n_components_range)
plt.ylim([bic.min() * 1.01 - .01 * bic.max(), bic.max()])
plt.title('BIC score per model')
xpos = np.mod(bic.argmin(), len(n_components_range)) + .65 +\
    .2 * np.floor(bic.argmin() / len(n_components_range))
plt.text(xpos, bic.min() * 0.97 + .03 * bic.max(), '*', fontsize=14)
spl.set_xlabel('Number of components')
spl.legend([b[0] for b in bars], cv_types)

# Plot the winner
splot = plt.subplot(2, 1, 2)
Y_ = clf.predict(X)
for i, (mean, cov, color) in enumerate(zip(clf.means_, clf.covariances_,
                                           color_iter)):
    v, w = linalg.eigh(cov)
    if not np.any(Y_ == i):
        continue
    plt.scatter(X[Y_ == i, 0], X[Y_ == i, 1], .8, color=color)

    # Plot an ellipse to show the Gaussian component
    angle = np.arctan2(w[0][1], w[0][0])
    angle = 180. * angle / np.pi  # convert to degrees
    v = 2. * np.sqrt(2.) * np.sqrt(v)
    ell = mpl.patches.Ellipse(mean, v[0], v[1], 180. + angle, color=color)
    ell.set_clip_box(splot.bbox)
    ell.set_alpha(.5)
    splot.add_artist(ell)

plt.xticks(())
plt.yticks(())
plt.title('Selected GMM: full model, 2 components')
plt.subplots_adjust(hspace=.35, bottom=.02)
plt.show()

enter image description here

Finall, from the image below, you can clearly see how

ASH
  • 20,759
  • 19
  • 87
  • 200
  • Thank you very much for your answer, it is now clearer why K Mean is not a good option in this setting. I obtained good results with Hierarchical clustering and K Medoids since they allow the use of DTW distance matrix. I will try also GMM. – Miriam_G Mar 03 '20 at 22:27
  • Hierarchical Clustering is great!! Affinity Propagation is great too!! I haven't experimented with K Medoids yet, so I can't speak on that. – ASH Mar 03 '20 at 22:29
  • @ASH I dunno, maybe K means works for time series and maybe it doesn't, but your criticisms are off the mark. The objects which are being clustered in clustering approaches for time series are segments of the series which are treated as vectors in a n-dimensional space where n is the length of the segments. All the fun is in how distance is computed in that space, which leads to time warping and so on. It's not the case that one is clustering points in the 2-dimensional (x, y) space, so it's beside the point to say that it's incorrect. – Robert Dodier Mar 04 '20 at 00:35
0

Here's an example of how to visualize clusters using plotGMM. The code to reproduce follows:

require(quantmod)
SCHB  <- fortify(getSymbols('SCHB', auto.assign=FALSE))
set.seed(730) # for reproducibility
mixmdl <- mixtools::normalmixEM(Cl(SCHB), k = 5); plot_GMM(mixmdl, k = 5) # 5 clusters
plot_GMM(mixmdl, k = 5)

I hope that helps. Oh, and for plotting time series with ggplot2, you should avail yourself of ggplot2's fortify function. Hope that helps.

hd1
  • 33,938
  • 5
  • 80
  • 91