When to stop agglomerative hierarchical clustering - stopping criteria

Question

I am coding my application each function so i am not using tools which does everything for you

Been looking for solution when to cut my agglomerative hierarchical clustering

How do i cluster?

I have coded application in c# 4.5.2

So far i am using standard hierarchical which uses Euclidean_Distance to calculate distance between document pairs

Also it uses UPGMA to calculate distance between clusters to decide merge which ones

I also coded Rand Index and F Measure to test my manually labeled data-set success

However the problem is when stop merging more clusters

I am really bad at understanding mathematical equations without real data example or a well explained pseudo code

There are mathematical equations everywhere but no real life example

So looking for your answers. For example it is written in many places Bayesian information criterion (BIC) is a good solution but i cant figure out how to apply it to my software

I also have other distance or similarity metrics such as cosine similarity or Sorensen Dice Distance etc

There are so many questions on stackexchange or stackoverflow about this but all answers are using tools

like matlab or R or etc

score 2 · Answer 1 · edited May 23 '17 at 11:45

2

Try to compute some measure of how well each particular clustering fits - for example, the sum of distances from cluster centres, or the sum of squared errors. You should find that this error decreases as you increase the number of clusters - it is easier to fit with more clusters, and increases as you decrease the number of clusters.

Now draw a graph and look for an "elbow" where the error starts to get large more quickly as the number of clusters decreases. You could then assume that the minimum number of clusters before the error starts increasing very rapidly is the true number of clusters in the data.

See for example the graph in Cluster analysis in R: determine the optimal number of clusters just below the text "We might conclude that 4 clusters would be indicated by this method:"

edited May 23 '17 at 11:45

Community

1
1

answered Sep 05 '15 at 04:34

mcdowella

19,301
2
19
25

ty for answer. however draw a graph means supervised technique. i have to make it programmatically. also in my application i dont have any idea how can i draw a graph of it :D – Furkan Gözükara Sep 05 '15 at 11:37
The article https://www.stat.washington.edu/wxs/Stat592-w2011/Literature/tibshirani-walther-prediction-strength-2005.pdf describes a way to score clusterings with different numbers of clusters using cross validation. A quick read suggests that it divides up the data to cluster things repeatedly and looks to see if pairs of points are reliably clustered together, or reliably not clustered together. The hope is that if you get the number of clusters right, the clusterings you form with this number of clusters will have this property. – mcdowella Sep 07 '15 at 04:40

When to stop agglomerative hierarchical clustering - stopping criteria

1 Answers1