0

Which is the best method for document classification if time is not a factor, and we dont know how many classes there are?

Evan
  • 1,683
  • 7
  • 35
  • 65

2 Answers2

2

In my (incomplete) knowledge, Hierarchical Agglomerative Clustering is the best approach if you don't know how many classes. All of the other clustering algorithms either require prior knowledge of the number of buckets or some sort of cross-validation or other experimentation to determine a sensible number of buckets.

bmargulies
  • 97,814
  • 39
  • 186
  • 310
  • +1. With flat clustering though, sqrt(N) for N items is sometimes recommended as the number of clusters. – Fred Foo Apr 11 '11 at 21:03
  • How about doing the actual clustering with Growing Som algorithm and then using the bottom up approach of HAC? So that we dont have to guess the number of clusters either. – Evan Apr 11 '11 at 21:07
  • I have no background on Growing Som, so I can't advise you either way about that. – bmargulies Apr 11 '11 at 21:17
  • Its just a Self Organizing Map that chooses to grow resolution(clusters) where the mean error seems to be above a threshold. It is flat clustering but saves the topology so the hierarchy can be found by comparing the differences between neighboring clusters. – Evan Apr 11 '11 at 21:23
1

A cross link: see how-do-i-determine-k-when-using-k-means-clustering on SO.

Community
  • 1
  • 1
denis
  • 21,378
  • 10
  • 65
  • 88
  • Thanks! I have already started with growing som though. It will be helpful for determining the starting grid size though. – Evan Apr 13 '11 at 19:06