7

I have a simple 2-dimensional dataset that I wish to cluster in an agglomerative manner (not knowing the optimal number of clusters to use). The only way I've been able to cluster my data successfully is by giving the function a 'maxclust' value.

For simplicity's sake, let's say this is my dataset:

X=[ 1,1;
    1,2;
    2,2;
    2,1;
    5,4;
    5,5;
    6,5;
    6,4 ];

Naturally, I would want this data to form 2 clusters. I understand that if I knew this, I could just say:

T = clusterdata(X,'maxclust',2);

and to find which points fall into each cluster I could say:

cluster_1 = X(T==1, :);

and

cluster_2 = X(T==2, :);

but without knowing that 2 clusters would be optimal for this dataset, how do I cluster these data?

Thanks

Argalatyr
  • 4,639
  • 3
  • 36
  • 62
Kevin_TA
  • 4,575
  • 13
  • 48
  • 77
  • similar question: [What stop-criteria for agglomerative hierarchical clustering are used in practice?](http://stats.stackexchange.com/q/2597) – Amro Nov 05 '11 at 01:40

2 Answers2

7

The whole point of this method is that it represents the clusters found in a hierarchy, and it is up to you to determine how much details you want to get..

agglomerative dendogram

Think of this as having a horizontal line intersecting the dendrogram, which moves starting from 0 (each point is its own cluster) all the way to the max value (all points in one cluster). You could:

  • stop when you reach a predetermined number of clusters (example)
  • manually position it given a certain height value (example)
  • choose to place it where the clusters are too far apart according to the distance criterion (ie there's a big jump to the next level) (example)

This can be done by either using the 'maxclust' or 'cutoff' arguments of the CLUSTER/CLUSTERDATA functions

Community
  • 1
  • 1
Amro
  • 123,847
  • 25
  • 243
  • 454
  • Great explanation, Amro. I noticed this is a topic you seem to have extensive experience with. Those links were all very useful to my application. Thanks! – Kevin_TA Nov 07 '11 at 22:30
6

To choose the optimal number of clusters, one common approach is to make a plot similar to a Scree Plot. Then you look for the "elbow" in the plot, and that is the number of clusters you pick. For the criterion here, we will use the within-cluster sum-of-squares:

function wss = plotScree(X, n)

wss = zeros(1, n);
wss(1) = (size(X, 1)-1) * sum(var(X, [], 1));
for i=2:n
    T = clusterdata(X,'maxclust',i);
    wss(i) = sum((grpstats(T, T, 'numel')-1) .* sum(grpstats(X, T, 'var'), 2));
end
hold on
plot(wss)
plot(wss, '.')
xlabel('Number of clusters')
ylabel('Within-cluster sum-of-squares')
>> plotScree(X, 5)

ans =

   54.0000    4.0000    3.3333    2.5000    2.0000

enter image description here

John Colby
  • 22,169
  • 4
  • 57
  • 69
  • the within-cluster sum of square might not always be a good criterion, especially since the hierarchical clustering defaults to using the single-linkage method, which by definition, only wants separation (between-clusters) and doesn't care about compactness or balance (within-cluster) – Amro Nov 05 '11 at 01:50
  • Thanks for this code. It was particularly useful when combined with the method used in the third link shared by Amro. – Kevin_TA Nov 07 '11 at 22:29