0

I want to cluster datapoints into three groups with k-means. I know the center of one of these three groups, but not of the other two. Therefore, I would like to pre-set the center of the one group and have the algorithm cluster accordingly, keeping that one center fixed. However, I am not sure if and how I can do this with the k-means package in R.

If I do the clustering without pre-setting the center, then the center of the group I know about gets shifted into the direction of the other clusters' centers, which likely leads to false classification.

Thank you for any input.

Juliane

camille
  • 16,432
  • 18
  • 38
  • 60
jtz
  • 7
  • 3
  • Possible duplicate of [Refitting clusters around fixed centroids](https://stackoverflow.com/questions/33399000/refitting-clusters-around-fixed-centroids) – user3471881 Sep 14 '19 at 08:28
  • Can you add a [reproducible example](https://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example) – camille Sep 14 '19 at 19:47

1 Answers1

1

Sure, we can make our own initialization routine. For example we can modify the Forgy method like this

# modified Forgy
set.seed(1)

c1 <- c(7.8, 4.3, 6.8, 2.4)
cn <- rbind(c1, iris[sample(nrow(iris), 2),-5])

kmeans(iris[,-5], cn)$centers
#   Sepal.Length Sepal.Width Petal.Length Petal.Width
# 1     6.684427    2.626896     6.512092  2.09042298
# 2     5.078494    3.646351     1.485264  0.05223007
# 3     6.012102    2.553765     3.869828  1.66717281

The first initial centre is fixed, while the rest are selected randomly from the rows in the data set.
Of course this makes the nstart argument inapplicable, but we can replicate this functionality easily by repeating the above calculation a bunch of times, and then pick the result with the highest BCSS

# modified Forgy with nstart
set.seed(1)
data(iris)
m <- iris[,-5]

# initializing with the actual centroid of the first species
c1 <- colMeans(m[as.integer(iris[,5]) == 1,])
c1
# Sepal.Length  Sepal.Width Petal.Length  Petal.Width 
#        5.006        3.428        1.462        0.246 

kf <- function(x, clust, nc) {
    cn <- rbind(clust, x[sample(nrow(x), nc-1),])
    kmeans(x, cn)
}

l <- replicate(100, kf(m, c1, 3), simplify=FALSE)
bss <- sapply(l, '[[', "betweenss")
table(signif(bss, 4))
# 
# 538.6 602.5 
#    37    63 
kmo <- l[[which.max(bss)]]

kmo$centers
#   Sepal.Length Sepal.Width Petal.Length Petal.Width
# 1     5.006000    3.428000     1.462000    0.246000
# 2     5.901613    2.748387     4.393548    1.433871
# 3     6.850000    3.073684     5.742105    2.071053
AkselA
  • 8,153
  • 2
  • 21
  • 34
  • Thank you very much for this solution. I could apply it to my data, starting with a center of 0/0 for one of the groups. – jtz Jan 03 '20 at 11:24