Text classification, preprocessing included

Question

Which is the best method for document classification if time is not a factor, and we dont know how many classes there are?

we dont know how many classes there are so clustering and then class labeling — Evan, Apr 11 '11 at 21:00

bmargulies · Accepted Answer · 2011-04-11T21:16:33.553

2

In my (incomplete) knowledge, Hierarchical Agglomerative Clustering is the best approach if you don't know how many classes. All of the other clustering algorithms either require prior knowledge of the number of buckets or some sort of cross-validation or other experimentation to determine a sensible number of buckets.

edited Apr 11 '11 at 21:16

answered Apr 11 '11 at 21:01

bmargulies

97,814
39
186
310

+1. With flat clustering though, sqrt(N) for N items is sometimes recommended as the number of clusters. – Fred Foo Apr 11 '11 at 21:03
How about doing the actual clustering with Growing Som algorithm and then using the bottom up approach of HAC? So that we dont have to guess the number of clusters either. – Evan Apr 11 '11 at 21:07
I have no background on Growing Som, so I can't advise you either way about that. – bmargulies Apr 11 '11 at 21:17
Its just a Self Organizing Map that chooses to grow resolution(clusters) where the mean error seems to be above a threshold. It is flat clustering but saves the topology so the hierarchy can be found by comparing the differences between neighboring clusters. – Evan Apr 11 '11 at 21:23

score 1 · Answer 2 · edited May 23 '17 at 11:55

1

A cross link: see how-do-i-determine-k-when-using-k-means-clustering on SO.

edited May 23 '17 at 11:55

Community

1
1

answered Apr 13 '11 at 13:54

denis

21,378
10
65
88

Thanks! I have already started with growing som though. It will be helpful for determining the starting grid size though. – Evan Apr 13 '11 at 19:06

Text classification, preprocessing included

2 Answers2