0

I am trying to understand what the stat:kmeans does differently to the simple version explained eg on Wikipedia. I am honestly so supremely clueless.

Reading the help on kmeans I learned that the default algorithm is Hartigan–Wong not the more basic method, so there should be a difference, but playing around with some normal distributed variables I couldn't find a case where they differed substantially and predictably.

For reference, this is my utterly horrible code I tested it against

##squre of eudlidean metric
my_metric <- function(x=vector(),y=vector()) {
  stopifnot(length(x)==length(y))
  sum((x-y)^2)
}

## data: xy data
## k: amount of groups
my_kmeans <- function(data, k, maxIt=10) {

  ##get length and check if data lengths are equal and if enough data is provided
  l<-length(data[,1])
  stopifnot(l==length(data[,2]))
  stopifnot(l>k)

  ## generate the starting points
  ms <- data[sample(1:l,k),]

  ##append the data with g column and initilize last
  data$g<-0
  last <- data$g

  it<-0
  repeat{
    it<-it+1
    ##iterate through each data point and assign to cluster
    for(i in 1:l){
      distances <- c(Inf,Inf,Inf)
      for(j in 1:k){
        distances[j]<-my_metric(data[i,c(1,2)],ms[j,])
      }
      data$g[i] <- which.min(distances)

    }

    ##update cluster points
    for(i in 1:k){
      points_in_cluster <- data[data$g==i,1:2]
      ms[i,] <- c(mean(points_in_cluster[,1]),mean(points_in_cluster[,2]))
    }

    ##break condition: nothing changed
    if(my_metric(last,data$g)==0 | it > maxIt){
      break
    }
    last<-data$g
  }

  data
}

Chalky
  • 13
  • 5

1 Answers1

0

First off, this was a duplication (as I just found out) of this post. But I will still try to give an example: When the clusters are separated, Lloyd tends to leave the centers inside the clusters they start in, meaning that some may end up partitioned while some others might be lumped together

Chalky
  • 13
  • 5