19

I've been using k-means to cluster my data in R but I'd like to be able to assess the fit vs. model complexity of my clustering using Baysiean Information Criterion (BIC) and AIC. Currently the code I've been using in R is:

KClData <- kmeans(Data, centers=2, nstart= 100)

But I'd like to be able to extract the BIC and Log Likelihood. Any help would be greatly appreciated!

UnivStudent
  • 402
  • 1
  • 3
  • 11
  • 1
    Function `Mclust` in package mclust might be of interest. – Roland Apr 05 '13 at 17:45
  • Roland, thanks for the tip! I'm actually trying to compare the results of k-means to Mclust outputs which is why I'd like to use the BIC from my k-means clustering to GMM that Mclust uses. – UnivStudent Apr 05 '13 at 17:50
  • 3
    I am not an expert, but think that k-means is not a maximum likelihood algorithm. Are you sure that AIC and BIC are applicable? – Roland Apr 05 '13 at 17:58
  • It does have a log Likelihood associated with it but I'm having trouble finding it and implementing it in R. – UnivStudent Apr 06 '13 at 19:53
  • See a similar question on a statistician community http://stats.stackexchange.com/q/55147/3277 – ttnphns Jun 30 '16 at 09:34

5 Answers5

17

For anyone else landing here, there's a method proposed by Sherry Towers at http://sherrytowers.com/2013/10/24/k-means-clustering/, which uses output from stats::kmeans. I quote:

The AIC can be calculated with the following function:

kmeansAIC = function(fit){

m = ncol(fit$centers)
n = length(fit$cluster)
k = nrow(fit$centers)
D = fit$tot.withinss
return(D + 2*m*k)
}

From the help for stats::AIC, you can also see that the BIC can be calculated in a similar way to the AIC. An easy way to get the BIC is to replace the return() in the above function, with this:

return(data.frame(AIC = D + 2*m*k,
                  BIC = D + log(n)*m*k))

So you would use this as follows:

fit <- kmeans(x = data,centers = 6)
kmeansAIC(fit)
Andy Clifton
  • 4,926
  • 3
  • 35
  • 47
  • 1
    I have used your methods, but the BIC value of the kmeans results always monotone decreasing with the cluster number increasing. Please look the posts: http://stats.stackexchange.com/questions/55147/k-means-bic-to-validate-clusters-in-r/183097#183097 – pengchy Nov 24 '15 at 04:21
  • 1
    Taking a wild guess, I'd say there's an error somewhere. – Andy Clifton Nov 24 '15 at 04:24
  • Thank you Any Clifton, I have tested the BIC with higher K number, when K reach 155 the BIC come to the smallest value. Originally, I have only tested the K with maximum 50. – pengchy Nov 24 '15 at 07:53
  • 1
    @pengchy: you may be interested in this answer about finding optimum values of k: http://stackoverflow.com/a/18047359/2514568 – Andy Clifton Nov 24 '15 at 10:17
  • Hi Andy Clifton, thank you for the information. It is very helpful. – pengchy Nov 24 '15 at 23:30
  • Also https://stats.stackexchange.com/questions/55147/compute-bic-clustering-criterion-to-validate-clusters-after-k-means/183097#183097 – Chris Aug 20 '21 at 15:34
7

To compute BIC, just add .5*k*d*log(n) (where k is the number of means, d is the length of a vector in your dataset, and n is the number of data points) to the standard k-means error function.

The standard k-means penalty is \sum_n (m_k(n)-x_n)^2, where m_k(n) is the mean associated with the nth data point. This penalty can be interpreted as a log probability, so BIC is perfectly valid.

BIC just adds an additional penalty term to the k-means error proportional to k.

user1149913
  • 4,463
  • 1
  • 23
  • 28
  • I don't think the k-means penalty `\sum_n (m_k(n) - x_n)^2` (or the negative of that) is the log-likelihood. The log-likelihood should have 3 more terms: `-n*log(K)`, `-0.5*n*d*log(2*pi)` and `-n*d*log(\sigma)`, where \sigma is the common std for all Gaussians. Also, the "k" in the BIC formula is not the number of clusters, it is the number of free parameters in the mixture Gaussian model, so k should be: `k = K-1 + 1 + K*d = K*(d+1)`. Here I am using lower k for the BIC parameter term, and capital K as the number of means/clusters. – Jason Jun 14 '15 at 14:39
  • `K-1` is K-1 number of weights for the Gaussians, since the weights add up to 1, the dof is K-1. `1` for the one common std for all Gaussians. And `K*d` is the number of parameters for the coordinates of the cluster centers. – Jason Jun 14 '15 at 14:40
  • Are there any weights in k-means? I think not.. So why do we need K-1? – user5054 Jun 30 '16 at 09:21
4

Just to add to what user1149913 said (I don't have enough reputation to comment), since you're using the kmeans function in R, \sum_n (m_k(n)-x_n)^2 is already calculated for you as KClData$tot.withinss.

FrankD
  • 175
  • 5
2

Rather than reimplementing AIC or BIC, we can define a log-likelihood function for kmeans objects; this will then get used by the BIC function in the stats package.

logLik.kmeans <- function(object) structure(
  -object$tot.withinss/2,
  df = nrow(object$centers)*ncol(object$centers),
  nobs = length(object$cluster)
)

Then to use it, call BIC as normal. For example:

example(kmeans, local=FALSE)
BIC(cl)
# [1] 26.22842084

This method will be provided in the next release of the stackoverflow package.

Neal Fultz
  • 9,282
  • 1
  • 39
  • 60
0

A function qualityCriterion::longitudinalData can calculate BIC and AIC for k-means cluster. It first calculates the likelihood for each cluster centres on every individual before merging together with weights on cluster size. The sd of the normal density function is based on RSS.

While the original code gives -BIC instead of BIC, I adopted the code for BIC:

qualityCriterion <- function (traj, clusters) 
{
    if (nrow(traj) != length(clusters)) {
        stop("[qualityCriterion] the cluster and the number of trajectory should be the same.")
    }

    clusters <- as.integer(clusters)
    nbIndiv <- nrow(traj)
    nbTime <- ncol(traj)
    nbClusters <- length(unique(clusters))

    # Cluster frequency
    preProba <- as.numeric(table(clusters))
    preProba <- preProba / sum(preProba)

    # Centers as cluster means
    moy <- matrix(, nbClusters, nbTime)
    for (i in 1:nbClusters) {
        moy[i, ] <- apply(traj[as.numeric(clusters) == i, , drop = FALSE], 2, meanNA)
    }
    
    # sd of residuals
    ecart <- sqrt(mean(as.numeric(traj - moy[clusters, ])^2, na.rm=TRUE))

    # likelihood
    vraisIndivXcluster <- matrix(, nbIndiv, nbClusters)
    for (i in 1:nbClusters) {
        vraisIndivXcluster[, i] <- preProba[i] * apply(dnorm(t(traj), moy[i, ], ecart), 
                                                       2, prod, na.rm = TRUE
                                                      )
    }
    vraisIndivXcluster <- apply(vraisIndivXcluster, 1, sum)
    logVraisemblance <- sum(log(vraisIndivXcluster))

    nbParam <- nbClusters * nbTime + 1 # cluster centers and sd
    #BIC <- -2 * logVraisemblance + nbParam * log(nbIndiv) # BIC for time series
    BIC2 <- -2 * logVraisemblance + nbParam * log(nbIndiv * nbTime) # BIC for independent columns
    AIC <- 2 * nbParam - 2 * logVraisemblance
    #AICc <- AIC + (2 * nbParam * (nbParam + 1)) / (nbIndiv - nbParam - 1) # AICc for time series
    #AICc2 <- AIC + (2 * nbParam * (nbParam + 1)) / (nbIndiv * nbTime - nbParam - 1) # AICc for independent columns

    return(list(criters = c(BIC2 = BIC2, AIC = AIC)))
}

Example of 100 individuals, each with 2 data points, forming 3 clusters:

set.seed(1)
dat <- matrix(rnorm(100 * 2), nrow = 100, ncol = 2) # data of 100 individuals
dat[34:66,] <- dat[34:66,] + 4
dat[67:100,] <- dat[67:100,] + 8
plot(dat[,1], dat[,2]) # 3 cluster centers at (0,0), (4,4), (8,8)

enter image description here

As expected, minimum BIC at k = 3 clusters:

# k-means with k = 2:5
clusters_2 <- kmeans(dat, centers = 2)
clusters_3 <- kmeans(dat, centers = 3)
clusters_4 <- kmeans(dat, centers = 4)
clusters_5 <- kmeans(dat, centers = 5)

BIC <- c(qualityCriterion(dat, rep(1, nrow(dat)))$criters["BIC2"],
         qualityCriterion(dat, clusters_2$cluster)$criters["BIC2"],
         qualityCriterion(dat, clusters_3$cluster)$criters["BIC2"],
         qualityCriterion(dat, clusters_4$cluster)$criters["BIC2"],
         qualityCriterion(dat, clusters_5$cluster)$criters["BIC2"]
        )

plot(1:5, BIC, xlab = "k", ylab = "BIC")
lines(1:5, BIC)

enter image description here

Ryan SY Kwan
  • 476
  • 3
  • 9