3

In my approach I am trying to find the optimal value of 'k' for clustering a set of documents using KMEANS algorithm. I wanted to use 'AIC' and 'BIC' information criterion function for finding the best model. I used this resource "sherrytowers.com/2013/10/24/k-means-clustering/" for finding the best value of 'k'.

But I got the following graphs for AIC and BIC when I ran the code. Iam unable to interpret anything from the graphs. my doubts are

  1. Is my approach wrong and these measures (AIC,BIC) cannot be used for document clustering using Kmeans?
  2. Or there are errors in programming logic and 'AIC' and 'BIC' are the right way to find 'k' the number of clusters?

Here's my code

library(tm)
library(SnowballC)
corp <- Corpus(DirSource("/home/dataset/"), readerControl = list(blank.lines.skip=TRUE));  ## forming Corpus from document set 
corp <- tm_map(corp, stemDocument, language="english")
dtm <- DocumentTermMatrix(corp,control=list(minwordlength = 1)) ## forming Document Term Matrix
dtm_tfidf <- weightTfIdf(dtm)
m <- as.matrix(dtm_tfidf)
norm_eucl <- function(m) m/apply(m, MARGIN=1, FUN=function(x) sum(x^2)^.5)
m_norm <- norm_eucl(m)

kmax = 50

totwss = rep(0,kmax) # will be filled with total sum of within group sum squares
kmfit = list() # create and empty list
for (i in 1:kmax){
  kclus = kmeans(m_norm,centers=i,iter.max=20)
  totwss[i] = kclus$tot.withinss
  kmfit[[i]] = kclus
}

kmeansAIC = function(fit){

  m = ncol(fit$centers)
  n = length(fit$cluster)
  k = nrow(fit$centers)
  D = fit$tot.withinss
  return(D + 2*m*k)
}
aic=sapply(kmfit,kmeansAIC)
plot(seq(1,kmax),aic,xlab="Number of clusters",ylab="AIC",pch=20,cex=2)


kmeansBIC = function(fit){

  m = ncol(fit$centers)
  n = length(fit$cluster)
  k = nrow(fit$centers)
  D = fit$tot.withinss
  return(D + log(n)*m*k)
}
bic=sapply(kmfit,kmeansBIC)
plot(seq(1,kmax),bic,xlab="Number of clusters",ylab="BIC",pch=20,cex=2)

These are the graphs it generated http://snag.gy/oAfhk.jpg http://snag.gy/vT8fZ.jpg

merv
  • 67,214
  • 13
  • 180
  • 245
Raghav
  • 41
  • 4

0 Answers0